The source from which I obtained this data is" https://www.kaggle.com/datasets/shivachandel/kc-house-data
Goal: To analyze the housing data set, understand the trends and patterns in the housing market, and build a predictive model that can accurately estimate the price of a house based on its features. This analysis will help potential buyers and sellers to make informed decisions and aid real estate agents in providing better recommendations to their clients. The goal is to identify the key factors that influence the pricing of houses, such as location, size, condition, and other features, and develop a model that can predict house prices with a high degree of accuracy. Additionally, the data will be explored to understand geographical trends, identify any outliers or anomalies, and evaluate any potential bias in the data set. Finally, the insights gained from the analysis will be presented in a clear and concise manner using visualizations and other descriptive statistics.
Introduction: This data set contains information on residential properties sold between May 2014 and May 2015 in King County, Washington state, USA. The data includes details such as the price, number of bedrooms and bathrooms, square footage of living space and lot size, number of floors, whether the property has a waterfront view or not, and other features, such as the year built and year renovated. The data set contains 21,613 observations and 21 attributes. The goal of this data set is to provide insights into the factors that affect the price of a house and to build a predictive model that can accurately estimate the price of a house based on its features. This data set can be used by real estate agents, buyers, and sellers to make informed decisions and to gain a better understanding of the housing market in King County. The data set can also be used by researchers and data scientists to explore and analyze trends and patterns in the housing market. The data set is publicly available and can be downloaded from various online sources.
Objectives:
#Import necessary libraries for basic data processing
import math
import numpy as np
import pandas as pd
import country_converter as coco
import time
from uszipcode import SearchEngine
#Import libraries for visualization
import seaborn as sns
import matplotlib.pyplot as plt
#Import libraries for modeling
#Convert categorical data into numerical data
from sklearn.preprocessing import LabelEncoder
#Split data into training and testing sets
from sklearn.model_selection import train_test_split
#Scale data
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import MinMaxScaler
#Import regression models
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
#Import classification metrics
from sklearn.metrics import classification_report,r2_score, mean_squared_error,accuracy_score,confusion_matrix, roc_curve, auc, precision_recall_curve
#Import learning curve
from sklearn.metrics import roc_curve, roc_auc_score, auc
#Import classification models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# iport scipy
from scipy import stats
# Import Deploying
import joblib
import pickle
# Warning
import warnings
warnings.filterwarnings("ignore")
C:\Users\Mohamed\anaconda3\lib\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
#load train dataset
data= pd.read_csv('kc_house_data.csv')
#show data
data
| id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7129300520 | 20141013T000000 | 221900.0 | 3 | 1.00 | 1180 | 5650 | 1.0 | 0 | 0 | ... | 7 | 1180.0 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 |
| 1 | 6414100192 | 20141209T000000 | 538000.0 | 3 | 2.25 | 2570 | 7242 | 2.0 | 0 | 0 | ... | 7 | 2170.0 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 |
| 2 | 5631500400 | 20150225T000000 | 180000.0 | 2 | 1.00 | 770 | 10000 | 1.0 | 0 | 0 | ... | 6 | 770.0 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 |
| 3 | 2487200875 | 20141209T000000 | 604000.0 | 4 | 3.00 | 1960 | 5000 | 1.0 | 0 | 0 | ... | 7 | 1050.0 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 |
| 4 | 1954400510 | 20150218T000000 | 510000.0 | 3 | 2.00 | 1680 | 8080 | 1.0 | 0 | 0 | ... | 8 | 1680.0 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 263000018 | 20140521T000000 | 360000.0 | 3 | 2.50 | 1530 | 1131 | 3.0 | 0 | 0 | ... | 8 | 1530.0 | 0 | 2009 | 0 | 98103 | 47.6993 | -122.346 | 1530 | 1509 |
| 21609 | 6600060120 | 20150223T000000 | 400000.0 | 4 | 2.50 | 2310 | 5813 | 2.0 | 0 | 0 | ... | 8 | 2310.0 | 0 | 2014 | 0 | 98146 | 47.5107 | -122.362 | 1830 | 7200 |
| 21610 | 1523300141 | 20140623T000000 | 402101.0 | 2 | 0.75 | 1020 | 1350 | 2.0 | 0 | 0 | ... | 7 | 1020.0 | 0 | 2009 | 0 | 98144 | 47.5944 | -122.299 | 1020 | 2007 |
| 21611 | 291310100 | 20150116T000000 | 400000.0 | 3 | 2.50 | 1600 | 2388 | 2.0 | 0 | 0 | ... | 8 | 1600.0 | 0 | 2004 | 0 | 98027 | 47.5345 | -122.069 | 1410 | 1287 |
| 21612 | 1523300157 | 20141015T000000 | 325000.0 | 2 | 0.75 | 1020 | 1076 | 2.0 | 0 | 0 | ... | 7 | 1020.0 | 0 | 2008 | 0 | 98144 | 47.5941 | -122.299 | 1020 | 1357 |
21613 rows × 21 columns
reed data and show it
"id" : A unique identifier for each record in the dataset.
"date": The date on which the property was sold.
'price': The price of the property in USD.
'bedrooms': The number of bedrooms in the property.
'bathrooms': The number of bathrooms in the property.
'sqft_living': The size of the property's living space in square feet.
'sqft_lot': The size of the property's lot in square feet.
'floors': The number of floors in the property.
'waterfront': A binary variable indicating whether the property is located on a waterfront or not.
'view': A rating of the property's view from 0 to 4.
'condition': A rating of the property's condition from 1 to 5.
'grade': A rating of the property's overall grade from 1 to 13.
'sqft_above': The size of the property's living space above ground level in square feet.
'sqft_basement': The size of the property's living space below ground level in square feet.
'yr_built': The year in which the property was built.
'yr_renovated': The year in which the property was last renovated.
'zipcode': The zipcode of the area in which the property is located.
'lat': The latitude coordinate of the property's location.
'long': The longitude coordinate of the property's location.
'sqft_living15': The average size of nearby houses' living space in square feet.
'sqft_lot15': The average size of nearby houses' lots in square feet.
First we run a quick analysis on the dataset itself, to get its quality overall, and get a perspective of what's going to be the first step to process and interpret the data.
# Print the number of rows and columns in the dataset
print(f'The dataset has {data.shape[0]} rows and {data.shape[1]} columns\n')
# Print a separator line
print('- -' * 30)
# Print value counts for each column in the dataset
print('Value counts for each column: \n')
for i in data.columns:
# Print the name of the column
print(f'===== {i} =====\n')
# Print the value counts for each unique value in the column, sorted in descending order
print(data[i].value_counts().sort_values(ascending=False))
# Print a separator line between each column's value counts
print('--' * 30)
The dataset has 21613 rows and 21 columns
- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
Value counts for each column:
===== id =====
795000620 3
5282200015 2
8832900780 2
526059224 2
5101405604 2
..
993001976 1
525049174 1
4187000190 1
6056110780 1
1523300157 1
Name: id, Length: 21436, dtype: int64
------------------------------------------------------------
===== date =====
20140623T000000 142
20140625T000000 131
20140626T000000 131
20140708T000000 127
20150427T000000 126
...
20141130T000000 1
20140803T000000 1
20150527T000000 1
20150110T000000 1
20140727T000000 1
Name: date, Length: 372, dtype: int64
------------------------------------------------------------
===== price =====
350000.0 172
450000.0 172
550000.0 159
500000.0 152
425000.0 150
...
514700.0 1
388598.0 1
471275.0 1
521500.0 1
402101.0 1
Name: price, Length: 4028, dtype: int64
------------------------------------------------------------
===== bedrooms =====
3 9824
4 6882
2 2760
5 1601
6 272
1 199
7 38
0 13
8 13
9 6
10 3
11 1
33 1
Name: bedrooms, dtype: int64
------------------------------------------------------------
===== bathrooms =====
2.50 5380
1.00 3852
1.75 3048
2.25 2047
2.00 1930
1.50 1446
2.75 1185
3.00 753
3.50 731
3.25 589
3.75 155
4.00 136
4.50 100
4.25 79
0.75 72
4.75 23
5.00 21
5.25 13
0.00 10
5.50 10
1.25 9
6.00 6
0.50 4
5.75 4
6.75 2
8.00 2
6.25 2
6.50 2
7.50 1
7.75 1
Name: bathrooms, dtype: int64
------------------------------------------------------------
===== sqft_living =====
1300 138
1400 135
1440 133
1660 129
1010 129
...
2478 1
1496 1
3402 1
1061 1
1425 1
Name: sqft_living, Length: 1038, dtype: int64
------------------------------------------------------------
===== sqft_lot =====
5000 358
6000 290
4000 251
7200 220
4800 120
...
914 1
4396 1
1449 1
1902 1
1076 1
Name: sqft_lot, Length: 9782, dtype: int64
------------------------------------------------------------
===== floors =====
1.0 10680
2.0 8241
1.5 1910
3.0 613
2.5 161
3.5 8
Name: floors, dtype: int64
------------------------------------------------------------
===== waterfront =====
0 21450
1 163
Name: waterfront, dtype: int64
------------------------------------------------------------
===== view =====
0 19489
2 963
3 510
1 332
4 319
Name: view, dtype: int64
------------------------------------------------------------
===== condition =====
3 14031
4 5679
5 1701
2 172
1 30
Name: condition, dtype: int64
------------------------------------------------------------
===== grade =====
7 8981
8 6068
9 2615
6 2038
10 1134
11 399
5 242
12 90
4 29
13 13
3 3
1 1
Name: grade, dtype: int64
------------------------------------------------------------
===== sqft_above =====
1300.0 212
1010.0 210
1200.0 206
1220.0 192
1140.0 184
...
2864.0 1
2716.0 1
1572.0 1
3281.0 1
1425.0 1
Name: sqft_above, Length: 946, dtype: int64
------------------------------------------------------------
===== sqft_basement =====
0 13126
600 221
700 218
500 214
800 206
...
2180 1
225 1
276 1
1248 1
248 1
Name: sqft_basement, Length: 306, dtype: int64
------------------------------------------------------------
===== yr_built =====
2014 559
2006 454
2005 450
2004 433
2003 422
...
1933 30
1901 29
1902 27
1935 24
1934 21
Name: yr_built, Length: 116, dtype: int64
------------------------------------------------------------
===== yr_renovated =====
0 20699
2014 91
2013 37
2003 36
2005 35
...
1951 1
1959 1
1948 1
1954 1
1944 1
Name: yr_renovated, Length: 70, dtype: int64
------------------------------------------------------------
===== zipcode =====
98103 602
98038 590
98115 583
98052 574
98117 553
...
98102 105
98010 100
98024 81
98148 57
98039 50
Name: zipcode, Length: 70, dtype: int64
------------------------------------------------------------
===== lat =====
47.6624 17
47.6846 17
47.5491 17
47.5322 17
47.6955 16
..
47.2920 1
47.3698 1
47.2839 1
47.2995 1
47.6502 1
Name: lat, Length: 5034, dtype: int64
------------------------------------------------------------
===== long =====
-122.290 116
-122.300 111
-122.362 104
-122.291 100
-122.363 99
...
-122.447 1
-121.797 1
-122.491 1
-121.837 1
-121.403 1
Name: long, Length: 752, dtype: int64
------------------------------------------------------------
===== sqft_living15 =====
1540 197
1440 195
1560 192
1500 181
1460 169
...
2238 1
2616 1
1427 1
2456 1
2927 1
Name: sqft_living15, Length: 777, dtype: int64
------------------------------------------------------------
===== sqft_lot15 =====
5000 427
4000 357
6000 289
7200 211
4800 145
...
6801 1
9937 1
26027 1
4795 1
2007 1
Name: sqft_lot15, Length: 8689, dtype: int64
------------------------------------------------------------
This code performs an exploratory data analysis on a dataset.
The first line of code prints the number of rows and columns in the dataset using the shape attribute of the Pandas DataFrame.
The second line of code prints a separator line to visually separate the output.
The third line of code initiates a loop that iterates over each column in the dataset.
The fourth line of code prints the name of the current column being analyzed.
The fifth line of code prints the value counts for each unique value in the column, sorted in descending order using the value_counts() method of the Pandas DataFrame.
The sixth line of code prints a separator line between each column's value counts.
Overall, this code provides a quick overview of the dataset by printing the number of rows and columns, and the value counts of each column in the dataset, which helps to identify the most common values in each attribute.
# Show the title of the column, the dtype and the lenght of the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 21613 non-null int64 1 date 21613 non-null object 2 price 21613 non-null float64 3 bedrooms 21613 non-null int64 4 bathrooms 21613 non-null float64 5 sqft_living 21613 non-null int64 6 sqft_lot 21613 non-null int64 7 floors 21613 non-null float64 8 waterfront 21613 non-null int64 9 view 21613 non-null int64 10 condition 21613 non-null int64 11 grade 21613 non-null int64 12 sqft_above 21611 non-null float64 13 sqft_basement 21613 non-null int64 14 yr_built 21613 non-null int64 15 yr_renovated 21613 non-null int64 16 zipcode 21613 non-null int64 17 lat 21613 non-null float64 18 long 21613 non-null float64 19 sqft_living15 21613 non-null int64 20 sqft_lot15 21613 non-null int64 dtypes: float64(6), int64(14), object(1) memory usage: 3.5+ MB
Show the title of the column, the dtype and the lenght of the dataset data.info() explain this code in 1 line
This code prints a summary of the dataset, including the column names, data types, and number of non-null values in each column.
# Statistic approach to numerical variables of the dataset
data.describe()
| id | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.161300e+04 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21611.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 |
| mean | 4.580302e+09 | 5.400881e+05 | 3.370842 | 2.114757 | 2079.899736 | 1.510697e+04 | 1.494309 | 0.007542 | 0.234303 | 3.409430 | 7.656873 | 1788.396095 | 291.509045 | 1971.005136 | 84.402258 | 98077.939805 | 47.560053 | -122.213896 | 1986.552492 | 12768.455652 |
| std | 2.876566e+09 | 3.671272e+05 | 0.930062 | 0.770163 | 918.440897 | 4.142051e+04 | 0.539989 | 0.086517 | 0.766318 | 0.650743 | 1.175459 | 828.128162 | 442.575043 | 29.373411 | 401.679240 | 53.505026 | 0.138564 | 0.140828 | 685.391304 | 27304.179631 |
| min | 1.000102e+06 | 7.500000e+04 | 0.000000 | 0.000000 | 290.000000 | 5.200000e+02 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 290.000000 | 0.000000 | 1900.000000 | 0.000000 | 98001.000000 | 47.155900 | -122.519000 | 399.000000 | 651.000000 |
| 25% | 2.123049e+09 | 3.219500e+05 | 3.000000 | 1.750000 | 1427.000000 | 5.040000e+03 | 1.000000 | 0.000000 | 0.000000 | 3.000000 | 7.000000 | 1190.000000 | 0.000000 | 1951.000000 | 0.000000 | 98033.000000 | 47.471000 | -122.328000 | 1490.000000 | 5100.000000 |
| 50% | 3.904930e+09 | 4.500000e+05 | 3.000000 | 2.250000 | 1910.000000 | 7.618000e+03 | 1.500000 | 0.000000 | 0.000000 | 3.000000 | 7.000000 | 1560.000000 | 0.000000 | 1975.000000 | 0.000000 | 98065.000000 | 47.571800 | -122.230000 | 1840.000000 | 7620.000000 |
| 75% | 7.308900e+09 | 6.450000e+05 | 4.000000 | 2.500000 | 2550.000000 | 1.068800e+04 | 2.000000 | 0.000000 | 0.000000 | 4.000000 | 8.000000 | 2210.000000 | 560.000000 | 1997.000000 | 0.000000 | 98118.000000 | 47.678000 | -122.125000 | 2360.000000 | 10083.000000 |
| max | 9.900000e+09 | 7.700000e+06 | 33.000000 | 8.000000 | 13540.000000 | 1.651359e+06 | 3.500000 | 1.000000 | 4.000000 | 5.000000 | 13.000000 | 9410.000000 | 4820.000000 | 2015.000000 | 2015.000000 | 98199.000000 | 47.777600 | -121.315000 | 6210.000000 | 871200.000000 |
describe data nummercal data
# Statistic approach to numerical variables of the dataset
data.describe().astype(int)
| id | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21611 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 | 21613 |
| mean | -2147483648 | 540088 | 3 | 2 | 2079 | 15106 | 1 | 0 | 0 | 3 | 7 | 1788 | 291 | 1971 | 84 | 98077 | 47 | -122 | 1986 | 12768 |
| std | -2147483648 | 367127 | 0 | 0 | 918 | 41420 | 0 | 0 | 0 | 0 | 1 | 828 | 442 | 29 | 401 | 53 | 0 | 0 | 685 | 27304 |
| min | 1000102 | 75000 | 0 | 0 | 290 | 520 | 1 | 0 | 0 | 1 | 1 | 290 | 0 | 1900 | 0 | 98001 | 47 | -122 | 399 | 651 |
| 25% | 2123049194 | 321950 | 3 | 1 | 1427 | 5040 | 1 | 0 | 0 | 3 | 7 | 1190 | 0 | 1951 | 0 | 98033 | 47 | -122 | 1490 | 5100 |
| 50% | -2147483648 | 450000 | 3 | 2 | 1910 | 7618 | 1 | 0 | 0 | 3 | 7 | 1560 | 0 | 1975 | 0 | 98065 | 47 | -122 | 1840 | 7620 |
| 75% | -2147483648 | 645000 | 4 | 2 | 2550 | 10688 | 2 | 0 | 0 | 4 | 8 | 2210 | 560 | 1997 | 0 | 98118 | 47 | -122 | 2360 | 10083 |
| max | -2147483648 | 7700000 | 33 | 8 | 13540 | 1651359 | 3 | 1 | 4 | 5 | 13 | 9410 | 4820 | 2015 | 2015 | 98199 | 47 | -121 | 6210 | 871200 |
describe data nummercal data and convert datatype to integer
# Statistic approach to All variables of the dataset
data.describe(include='all')
| id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.161300e+04 | 21613 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | ... | 21613.000000 | 21611.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 |
| unique | NaN | 372 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | 20140623T000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | 142 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 4.580302e+09 | NaN | 5.400881e+05 | 3.370842 | 2.114757 | 2079.899736 | 1.510697e+04 | 1.494309 | 0.007542 | 0.234303 | ... | 7.656873 | 1788.396095 | 291.509045 | 1971.005136 | 84.402258 | 98077.939805 | 47.560053 | -122.213896 | 1986.552492 | 12768.455652 |
| std | 2.876566e+09 | NaN | 3.671272e+05 | 0.930062 | 0.770163 | 918.440897 | 4.142051e+04 | 0.539989 | 0.086517 | 0.766318 | ... | 1.175459 | 828.128162 | 442.575043 | 29.373411 | 401.679240 | 53.505026 | 0.138564 | 0.140828 | 685.391304 | 27304.179631 |
| min | 1.000102e+06 | NaN | 7.500000e+04 | 0.000000 | 0.000000 | 290.000000 | 5.200000e+02 | 1.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 290.000000 | 0.000000 | 1900.000000 | 0.000000 | 98001.000000 | 47.155900 | -122.519000 | 399.000000 | 651.000000 |
| 25% | 2.123049e+09 | NaN | 3.219500e+05 | 3.000000 | 1.750000 | 1427.000000 | 5.040000e+03 | 1.000000 | 0.000000 | 0.000000 | ... | 7.000000 | 1190.000000 | 0.000000 | 1951.000000 | 0.000000 | 98033.000000 | 47.471000 | -122.328000 | 1490.000000 | 5100.000000 |
| 50% | 3.904930e+09 | NaN | 4.500000e+05 | 3.000000 | 2.250000 | 1910.000000 | 7.618000e+03 | 1.500000 | 0.000000 | 0.000000 | ... | 7.000000 | 1560.000000 | 0.000000 | 1975.000000 | 0.000000 | 98065.000000 | 47.571800 | -122.230000 | 1840.000000 | 7620.000000 |
| 75% | 7.308900e+09 | NaN | 6.450000e+05 | 4.000000 | 2.500000 | 2550.000000 | 1.068800e+04 | 2.000000 | 0.000000 | 0.000000 | ... | 8.000000 | 2210.000000 | 560.000000 | 1997.000000 | 0.000000 | 98118.000000 | 47.678000 | -122.125000 | 2360.000000 | 10083.000000 |
| max | 9.900000e+09 | NaN | 7.700000e+06 | 33.000000 | 8.000000 | 13540.000000 | 1.651359e+06 | 3.500000 | 1.000000 | 4.000000 | ... | 13.000000 | 9410.000000 | 4820.000000 | 2015.000000 | 2015.000000 | 98199.000000 | 47.777600 | -121.315000 | 6210.000000 | 871200.000000 |
11 rows × 21 columns
describe data nummercal and category data and convert datatype to integer
# Adds information about the missing values to each of the columns
data.isnull().sum()
id 0 date 0 price 0 bedrooms 0 bathrooms 0 sqft_living 0 sqft_lot 0 floors 0 waterfront 0 view 0 condition 0 grade 0 sqft_above 2 sqft_basement 0 yr_built 0 yr_renovated 0 zipcode 0 lat 0 long 0 sqft_living15 0 sqft_lot15 0 dtype: int64
check missing data
# Checking for '0' in salaries
data.query("price == 0")
| id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 |
|---|
0 rows × 21 columns
check if dataset has a price=0
#Change format to standarize the dataset describe() output
pd.set_option('display.float_format', lambda x: '%.5f' % x) # Set 5 decimals to eliminate numerical notation
data.describe()
| id | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21611.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 | 21613.00000 |
| mean | 4580301520.86499 | 540088.14177 | 3.37084 | 2.11476 | 2079.89974 | 15106.96757 | 1.49431 | 0.00754 | 0.23430 | 3.40943 | 7.65687 | 1788.39609 | 291.50905 | 1971.00514 | 84.40226 | 98077.93980 | 47.56005 | -122.21390 | 1986.55249 | 12768.45565 |
| std | 2876565571.31205 | 367127.19648 | 0.93006 | 0.77016 | 918.44090 | 41420.51152 | 0.53999 | 0.08652 | 0.76632 | 0.65074 | 1.17546 | 828.12816 | 442.57504 | 29.37341 | 401.67924 | 53.50503 | 0.13856 | 0.14083 | 685.39130 | 27304.17963 |
| min | 1000102.00000 | 75000.00000 | 0.00000 | 0.00000 | 290.00000 | 520.00000 | 1.00000 | 0.00000 | 0.00000 | 1.00000 | 1.00000 | 290.00000 | 0.00000 | 1900.00000 | 0.00000 | 98001.00000 | 47.15590 | -122.51900 | 399.00000 | 651.00000 |
| 25% | 2123049194.00000 | 321950.00000 | 3.00000 | 1.75000 | 1427.00000 | 5040.00000 | 1.00000 | 0.00000 | 0.00000 | 3.00000 | 7.00000 | 1190.00000 | 0.00000 | 1951.00000 | 0.00000 | 98033.00000 | 47.47100 | -122.32800 | 1490.00000 | 5100.00000 |
| 50% | 3904930410.00000 | 450000.00000 | 3.00000 | 2.25000 | 1910.00000 | 7618.00000 | 1.50000 | 0.00000 | 0.00000 | 3.00000 | 7.00000 | 1560.00000 | 0.00000 | 1975.00000 | 0.00000 | 98065.00000 | 47.57180 | -122.23000 | 1840.00000 | 7620.00000 |
| 75% | 7308900445.00000 | 645000.00000 | 4.00000 | 2.50000 | 2550.00000 | 10688.00000 | 2.00000 | 0.00000 | 0.00000 | 4.00000 | 8.00000 | 2210.00000 | 560.00000 | 1997.00000 | 0.00000 | 98118.00000 | 47.67800 | -122.12500 | 2360.00000 | 10083.00000 |
| max | 9900000190.00000 | 7700000.00000 | 33.00000 | 8.00000 | 13540.00000 | 1651359.00000 | 3.50000 | 1.00000 | 4.00000 | 5.00000 | 13.00000 | 9410.00000 | 4820.00000 | 2015.00000 | 2015.00000 | 98199.00000 | 47.77760 | -121.31500 | 6210.00000 | 871200.00000 |
print describtion of data and format is 5 decimal
# Check data in null and count it
data.isnull().sum()
id 0 date 0 price 0 bedrooms 0 bathrooms 0 sqft_living 0 sqft_lot 0 floors 0 waterfront 0 view 0 condition 0 grade 0 sqft_above 2 sqft_basement 0 yr_built 0 yr_renovated 0 zipcode 0 lat 0 long 0 sqft_living15 0 sqft_lot15 0 dtype: int64
# drop nan value
data.dropna(inplace=True)
drop null
#check missing value after delete missing
data.isnull().sum()
id 0 date 0 price 0 bedrooms 0 bathrooms 0 sqft_living 0 sqft_lot 0 floors 0 waterfront 0 view 0 condition 0 grade 0 sqft_above 0 sqft_basement 0 yr_built 0 yr_renovated 0 zipcode 0 lat 0 long 0 sqft_living15 0 sqft_lot15 0 dtype: int64
# Drop duplicates and print shape before drop
print('Shape of data before remove duolicate',data.shape)
# Drop duplicates and print shape after drop
data.drop_duplicates(inplace=True)
print('Shape of data after remove duolicate',data.shape)
Shape of data before remove duolicate (21611, 21) Shape of data after remove duolicate (21611, 21)
data.drop_duplicates(inplace=True): This drops duplicate rows in the DataFrame data and modifies the DataFrame in-place. print: This is a Python function that prints a message to the console. 'Shape of data after remove duplicate': This is a string message that will be printed to the console. data.shape: This returns a tuple representing the dimensions of the DataFrame after the duplicates have been dropped, where the first element is the number of rows and the second element is the number of columns.
(data[data['yr_renovated']==0].shape)[0]
20697
This selects all rows in the data DataFrame where the yr_renovated column is equal to 0.
(data[data['sqft_living']==data['sqft_living15']].shape)[0]
2566
This selects all rows in the data DataFrame where the sqft_living column is equal to the sqft_living15 column. .shape: This returns a tuple representing the dimensions of the resulting DataFrame, where the first element is the number of rows and the second element is the number of columns. [0]: This selects the first element of the tuple, which corresponds to the number of rows.
(data[data['sqft_lot']==data['sqft_lot15']].shape)[0]
4474
This line of Python code calculates the number of rows in a DataFrame data where the value in the sqft_lot column is equal to the value in the sqft_lot15 column. Here's a comment explaining each part of the code:
(data[data['sqft_lot']==data['sqft_lot15']].shape)[0]
data[data['sqft_lot']==data['sqft_lot15']]: This selects all rows in the data DataFrame where the sqft_lot column is equal to the sqft_lot15 column..shape: This returns a tuple representing the dimensions of the resulting DataFrame, where the first element is the number of rows and the second element is the number of columns.[0]: This selects the first element of the tuple, which corresponds to the number of rows.Therefore, the overall line of code returns the number of rows in data where sqft_lot is equal to sqft_lot15.
# Drop irrelevant columns
data.drop(['yr_renovated','id'], axis=1, inplace=True)
#print shape
data.shape
(21611, 19)
data.drop(['yr_renovated', 'id'], axis=1, inplace=True): This drops the yr_renovated and id columns from the data DataFrame and modifies the DataFrame in-place. ['yr_renovated', 'id']: This is a list of column names to drop. axis=1: This indicates that the columns should be dropped along the horizontal axis (i.e., columns). inplace=True: This indicates that the DataFrame should be modified in-place, rather than creating a new DataFrame. data.shape: This returns a tuple representing the dimensions of the DataFrame after the columns have been dropped, where the first element is the number of rows and the second element is the number of columns.
Data were collected from US so, we will use uszipcode
what we can get from it ? City, State, Population, Population Density, Housing Units
# Create an instance of the SearchEngine class
engine = SearchEngine()
# Record the start time
start = time.time()
# Define a function to get the location information for a given zipcode and add it to a DataFrame
def get_location(zipcode, data):
# Use the SearchEngine instance to get the location information for the given zipcode
location = engine.by_zipcode(zipcode)
# Add the location information to the DataFrame
data["city"] = location.major_city
data["state"] = location.state
data["county"] = location.county
data["population"] = location.population
data["population_density"] = location.population_density
# Return the updated DataFrame
return data
# Apply the get_location function to each row of the DataFrame using the apply method
data = data.apply(lambda x: get_location(x['zipcode'], x), axis=1)
# Record the end time
end = time.time()
# Print the execution time and the updated DataFrame
print(f"The time of execution of above program is :{end-start}\n")
data
The time of execution of above program is :230.04953861236572
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | ... | zipcode | lat | long | sqft_living15 | sqft_lot15 | city | state | county | population | population_density | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20141013T000000 | 221900.00000 | 3 | 1.00000 | 1180 | 5650 | 1.00000 | 0 | 0 | 3 | ... | 98178 | 47.51120 | -122.25700 | 1340 | 5650 | Seattle | WA | King County | 24092 | 4966.00000 |
| 1 | 20141209T000000 | 538000.00000 | 3 | 2.25000 | 2570 | 7242 | 2.00000 | 0 | 0 | 3 | ... | 98125 | 47.72100 | -122.31900 | 1690 | 7639 | Seattle | WA | King County | 37081 | 6879.00000 |
| 2 | 20150225T000000 | 180000.00000 | 2 | 1.00000 | 770 | 10000 | 1.00000 | 0 | 0 | 3 | ... | 98028 | 47.73790 | -122.23300 | 2720 | 8062 | Kenmore | WA | King County | 20419 | 3606.00000 |
| 3 | 20141209T000000 | 604000.00000 | 4 | 3.00000 | 1960 | 5000 | 1.00000 | 0 | 0 | 5 | ... | 98136 | 47.52080 | -122.39300 | 1360 | 5000 | Seattle | WA | King County | 14770 | 6425.00000 |
| 4 | 20150218T000000 | 510000.00000 | 3 | 2.00000 | 1680 | 8080 | 1.00000 | 0 | 0 | 3 | ... | 98074 | 47.61680 | -122.04500 | 1800 | 7503 | Sammamish | WA | King County | 25748 | 2411.00000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 20140521T000000 | 360000.00000 | 3 | 2.50000 | 1530 | 1131 | 3.00000 | 0 | 0 | 3 | ... | 98103 | 47.69930 | -122.34600 | 1530 | 1509 | Seattle | WA | King County | 45911 | 9905.00000 |
| 21609 | 20150223T000000 | 400000.00000 | 4 | 2.50000 | 2310 | 5813 | 2.00000 | 0 | 0 | 3 | ... | 98146 | 47.51070 | -122.36200 | 1830 | 7200 | Seattle | WA | King County | 25922 | 5573.00000 |
| 21610 | 20140623T000000 | 402101.00000 | 2 | 0.75000 | 1020 | 1350 | 2.00000 | 0 | 0 | 3 | ... | 98144 | 47.59440 | -122.29900 | 1020 | 2007 | Seattle | WA | King County | 26881 | 7895.00000 |
| 21611 | 20150116T000000 | 400000.00000 | 3 | 2.50000 | 1600 | 2388 | 2.00000 | 0 | 0 | 3 | ... | 98027 | 47.53450 | -122.06900 | 1410 | 1287 | Issaquah | WA | King County | 26141 | 469.00000 |
| 21612 | 20141015T000000 | 325000.00000 | 2 | 0.75000 | 1020 | 1076 | 2.00000 | 0 | 0 | 3 | ... | 98144 | 47.59410 | -122.29900 | 1020 | 1357 | Seattle | WA | King County | 26881 | 7895.00000 |
21611 rows × 24 columns
engine = SearchEngine(): This creates an instance of the SearchEngine class from the uszipcode library. start = time.time(): This records the current time in seconds since the Epoch using the time.time() function and assigns it to the variable start. def get_location(zipcode, data):: This defines a function get_location that takes a zipcode and a DataFrame data as arguments. location = engine.by_zipcode(zipcode): This uses the SearchEngine instance to get the location information for the given zipcode and assigns it to the variable location. data["city"] = location.major_city: This adds a new column to the DataFrame data with the major city name from the location information. data["state"] = location.state: This adds a new column to the DataFrame data with the state name from the location information. data["county"] = location.county: This adds a new column to the DataFrame data with the county name from the location information. data["population"] = location.population: This adds a new column to the DataFrame data with the population count from the location information. data["population_density"] = location.population_density: This adds a new column to the DataFrame data with the population density from the location information. return data: This returns the updated DataFrame from the get_location function. data = data.apply(lambda x: get_location(x['zipcode'], x), axis=1): This applies the get_location function to each row of the DataFrame data using the apply method and assigns the updated DataFrame to data. end = time.time(): This records the current time in seconds since the Epoch using the time.time() function and assigns it to the variable end. print(f"The time of execution of above program is :{end-start}\n"): This prints the execution time of the program by subtracting start from end and formatting the result as a string. data: This prints the updated DataFrame with the new location information columns.
# Convert the date column to a datetime format
data['date'] = pd.to_datetime(data['date'])
# Add a new column for the year of the transaction
data["tr_year"] = data["date"].dt.year
# Add a new column for the month of the transaction
data["tr_month"] = data["date"].dt.month
# Change the date column to a string format with only year and month
data["date"] = data["date"].dt.strftime('%Y-%m')
# Print the updated DataFrame
data
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | ... | long | sqft_living15 | sqft_lot15 | city | state | county | population | population_density | tr_year | tr_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-10 | 221900.00000 | 3 | 1.00000 | 1180 | 5650 | 1.00000 | 0 | 0 | 3 | ... | -122.25700 | 1340 | 5650 | Seattle | WA | King County | 24092 | 4966.00000 | 2014 | 10 |
| 1 | 2014-12 | 538000.00000 | 3 | 2.25000 | 2570 | 7242 | 2.00000 | 0 | 0 | 3 | ... | -122.31900 | 1690 | 7639 | Seattle | WA | King County | 37081 | 6879.00000 | 2014 | 12 |
| 2 | 2015-02 | 180000.00000 | 2 | 1.00000 | 770 | 10000 | 1.00000 | 0 | 0 | 3 | ... | -122.23300 | 2720 | 8062 | Kenmore | WA | King County | 20419 | 3606.00000 | 2015 | 2 |
| 3 | 2014-12 | 604000.00000 | 4 | 3.00000 | 1960 | 5000 | 1.00000 | 0 | 0 | 5 | ... | -122.39300 | 1360 | 5000 | Seattle | WA | King County | 14770 | 6425.00000 | 2014 | 12 |
| 4 | 2015-02 | 510000.00000 | 3 | 2.00000 | 1680 | 8080 | 1.00000 | 0 | 0 | 3 | ... | -122.04500 | 1800 | 7503 | Sammamish | WA | King County | 25748 | 2411.00000 | 2015 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 2014-05 | 360000.00000 | 3 | 2.50000 | 1530 | 1131 | 3.00000 | 0 | 0 | 3 | ... | -122.34600 | 1530 | 1509 | Seattle | WA | King County | 45911 | 9905.00000 | 2014 | 5 |
| 21609 | 2015-02 | 400000.00000 | 4 | 2.50000 | 2310 | 5813 | 2.00000 | 0 | 0 | 3 | ... | -122.36200 | 1830 | 7200 | Seattle | WA | King County | 25922 | 5573.00000 | 2015 | 2 |
| 21610 | 2014-06 | 402101.00000 | 2 | 0.75000 | 1020 | 1350 | 2.00000 | 0 | 0 | 3 | ... | -122.29900 | 1020 | 2007 | Seattle | WA | King County | 26881 | 7895.00000 | 2014 | 6 |
| 21611 | 2015-01 | 400000.00000 | 3 | 2.50000 | 1600 | 2388 | 2.00000 | 0 | 0 | 3 | ... | -122.06900 | 1410 | 1287 | Issaquah | WA | King County | 26141 | 469.00000 | 2015 | 1 |
| 21612 | 2014-10 | 325000.00000 | 2 | 0.75000 | 1020 | 1076 | 2.00000 | 0 | 0 | 3 | ... | -122.29900 | 1020 | 1357 | Seattle | WA | King County | 26881 | 7895.00000 | 2014 | 10 |
21611 rows × 26 columns
data['date'] = pd.to_datetime(data['date']): This converts the date column of the DataFrame data to a datetime format using the pd.to_datetime() function from the Pandas library.data["tr_year"] = data["date"].dt.year: This creates a new column in the DataFrame data called tr_year that contains the year of the transaction, obtained from the date column by using the .dt.year attribute.data["tr_month"] = data["date"].dt.month: This creates a new column in the DataFrame data called tr_month that contains the month of the transaction, obtained from the date column by using the .dt.month attribute.data["date"] = data["date"].dt.strftime('%Y-%m'): This modifies the date column of the DataFrame data to a string format with only year and month, obtained from the date column by using the .dt.strftime() method with the %Y-%m format.data: This prints the updated DataFrame with the new columns and modified date column.Therefore, the overall code performs transformations on the date column of data to extract the year and month of the transaction and change the date format to only show year and month.
# print MAX and MIN Month And Year
(data[['tr_year','tr_month']].max(),data[['tr_year','tr_month']].min())
(tr_year 2015 tr_month 12 dtype: int64, tr_year 2014 tr_month 1 dtype: int64)
the overall line of code returns a tuple containing the maximum and minimum values for the tr_year and tr_month columns of data. The first element of the tuple contains the maximum values, and the second element of the tuple contains the minimum values.
# Convert the 'price' column to integer data type
data['price'] = data['price'].astype(int)
# Convert the ' population density ' column to integer data type
data['population_density'] = data['population_density'].astype(int)
the overall code converts the price and population_density columns of data to integer data types, which is useful when performing mathematical operations on these columns.
#print Information of data
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 21611 entries, 0 to 21612 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 21611 non-null object 1 price 21611 non-null int32 2 bedrooms 21611 non-null int64 3 bathrooms 21611 non-null float64 4 sqft_living 21611 non-null int64 5 sqft_lot 21611 non-null int64 6 floors 21611 non-null float64 7 waterfront 21611 non-null int64 8 view 21611 non-null int64 9 condition 21611 non-null int64 10 grade 21611 non-null int64 11 sqft_above 21611 non-null float64 12 sqft_basement 21611 non-null int64 13 yr_built 21611 non-null int64 14 zipcode 21611 non-null int64 15 lat 21611 non-null float64 16 long 21611 non-null float64 17 sqft_living15 21611 non-null int64 18 sqft_lot15 21611 non-null int64 19 city 21611 non-null object 20 state 21611 non-null object 21 county 21611 non-null object 22 population 21611 non-null int64 23 population_density 21611 non-null int32 24 tr_year 21611 non-null int64 25 tr_month 21611 non-null int64 dtypes: float64(5), int32(2), int64(15), object(4) memory usage: 4.3+ MB
print info after update data
# Take copy from data
df_copy = data.copy(deep=True)
Take a copy of data
#calculates the percentage of 0 values in a DataFrame `df_copy` relative to the total number of rows in another DataFrame `data`
round((df_copy[df_copy == 0].count()/data.shape[0])*100)
date 0.00000 price 0.00000 bedrooms 0.00000 bathrooms 0.00000 sqft_living 0.00000 sqft_lot 0.00000 floors 0.00000 waterfront 99.00000 view 90.00000 condition 0.00000 grade 0.00000 sqft_above 0.00000 sqft_basement 61.00000 yr_built 0.00000 zipcode 0.00000 lat 0.00000 long 0.00000 sqft_living15 0.00000 sqft_lot15 0.00000 city 0.00000 state 0.00000 county 0.00000 population 0.00000 population_density 0.00000 tr_year 0.00000 tr_month 0.00000 dtype: float64
This code calculates the percentage of zero values in a Pandas DataFrame df_copy relative to the total number of rows in another DataFrame data.
The first part of the code df_copy[df_copy == 0].count() creates a boolean DataFrame where True values indicate the presence of a zero value in df_copy, and then counts the number of True values for each column.
The second part of the code data.shape[0] gets the total number of rows in the original DataFrame data.
The result of the above calculation is then multiplied by 100 to get the percentage of zero values in df_copy relative to the total number of rows in data.
The round() function is used to round the result to the nearest whole number.
# drop waterfront column from water from df_copy
df_copy.drop('waterfront',axis=1,inplace = True )
#print dataset
df_copy
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | view | condition | grade | ... | long | sqft_living15 | sqft_lot15 | city | state | county | population | population_density | tr_year | tr_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-10 | 221900 | 3 | 1.00000 | 1180 | 5650 | 1.00000 | 0 | 3 | 7 | ... | -122.25700 | 1340 | 5650 | Seattle | WA | King County | 24092 | 4966 | 2014 | 10 |
| 1 | 2014-12 | 538000 | 3 | 2.25000 | 2570 | 7242 | 2.00000 | 0 | 3 | 7 | ... | -122.31900 | 1690 | 7639 | Seattle | WA | King County | 37081 | 6879 | 2014 | 12 |
| 2 | 2015-02 | 180000 | 2 | 1.00000 | 770 | 10000 | 1.00000 | 0 | 3 | 6 | ... | -122.23300 | 2720 | 8062 | Kenmore | WA | King County | 20419 | 3606 | 2015 | 2 |
| 3 | 2014-12 | 604000 | 4 | 3.00000 | 1960 | 5000 | 1.00000 | 0 | 5 | 7 | ... | -122.39300 | 1360 | 5000 | Seattle | WA | King County | 14770 | 6425 | 2014 | 12 |
| 4 | 2015-02 | 510000 | 3 | 2.00000 | 1680 | 8080 | 1.00000 | 0 | 3 | 8 | ... | -122.04500 | 1800 | 7503 | Sammamish | WA | King County | 25748 | 2411 | 2015 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 2014-05 | 360000 | 3 | 2.50000 | 1530 | 1131 | 3.00000 | 0 | 3 | 8 | ... | -122.34600 | 1530 | 1509 | Seattle | WA | King County | 45911 | 9905 | 2014 | 5 |
| 21609 | 2015-02 | 400000 | 4 | 2.50000 | 2310 | 5813 | 2.00000 | 0 | 3 | 8 | ... | -122.36200 | 1830 | 7200 | Seattle | WA | King County | 25922 | 5573 | 2015 | 2 |
| 21610 | 2014-06 | 402101 | 2 | 0.75000 | 1020 | 1350 | 2.00000 | 0 | 3 | 7 | ... | -122.29900 | 1020 | 2007 | Seattle | WA | King County | 26881 | 7895 | 2014 | 6 |
| 21611 | 2015-01 | 400000 | 3 | 2.50000 | 1600 | 2388 | 2.00000 | 0 | 3 | 8 | ... | -122.06900 | 1410 | 1287 | Issaquah | WA | King County | 26141 | 469 | 2015 | 1 |
| 21612 | 2014-10 | 325000 | 2 | 0.75000 | 1020 | 1076 | 2.00000 | 0 | 3 | 7 | ... | -122.29900 | 1020 | 1357 | Seattle | WA | King County | 26881 | 7895 | 2014 | 10 |
21611 rows × 25 columns
This code drops the column named 'waterfront' from the Pandas DataFrame df_copy using the drop() method. The axis=1 parameter specifies that the column should be dropped, and the inplace=True parameter specifies that the changes should be made to df_copy directly.
After dropping the column, the updated df_copy DataFrame is printed to the console
# Define a function to handle outliers in a DataFrame
def handling_outliers(df, display=False, drop=False, drop_order=1, columns_to_drop=[]):
# Get a list of numerical columns in the DataFrame
numerical_columns = list((df.select_dtypes(include=np.number)).columns)
# If display is True, plot boxplots for each numerical column
if display:
x = math.ceil(len(numerical_columns)/3)
plt.figure(figsize=(15, 25))
plt.subplots_adjust(hspace=0.5)
plt.suptitle("Outliers Detection")
for i in numerical_columns:
y = numerical_columns.index(i) + 1
ax = plt.subplot(x, 3, y)
ax = sns.boxplot(x=df[i], data=df)
ax.set_title(i)
# If drop is True, remove outliers from the DataFrame
if drop == True:
# If columns_to_drop is not empty, use those columns
if (len(columns_to_drop) != 0):
numerical_columns = columns_to_drop
# If drop_order is less than 1, set it to 1
elif drop_order < 1:
drop_order = 1
# Remove outliers drop_order times using the interquartile range (IQR) method
while drop_order != 0:
for i in numerical_columns:
q1 = df[i].quantile(0.25)
q3 = df[i].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5*iqr
if lower < 0:
lower = 0
higher = q3 + 1.5*iqr
df = df[df[i] >= lower]
df = df[df[i] <= higher]
drop_order = drop_order - 1
# Return the updated DataFrame
return df
this code defines a function handling_outliers that takes a DataFrame df and some optional arguments to either display or remove outliers from the DataFrame. If display is True, the function plots boxplots for each numerical column in the DataFrame. If drop is True, the function removes outliers from the DataFrame using the interquartile range (IQR) method. The drop_order argument specifies how many times to remove outliers, and the columns_to_drop argument allows the user to specify which columns to remove outliers from. Finally, the function returns the updated DataFrame.
Before dropping just Specific Columns
# Calling handling_outliers ana passing a parameter and column to delete
df_1 = handling_outliers(df_copy , display= True , drop=True , drop_order=2 , columns_to_drop =['price','bedrooms','bathrooms','sqft_living','sqft_lot','sqft_basement','sqft_living15','sqft_lot15','grade'])
This code calls a function named handling_outliers and passes several parameters to it:
df_copy is a Pandas DataFrame that will be passed to the function for outlier handling.
display=True specifies that the function should print information about the handling of outliers to the console.
drop=True specifies that the function should drop the rows containing outliers from the DataFrame.
drop_order=2 specifies that the function should drop the rows containing outliers based on their Z-score values.
columns_to_drop=['price','bedrooms','bathrooms'] specifies that the function should drop the 'price', 'bedrooms', and 'bathrooms' columns from the DataFrame before handling outliers.
The function will then perform outlier handling on the DataFrame and return a new DataFrame df_1 that has had the specified columns dropped and outliers removed. The display=True parameter will print information about the handling of outliers to the console.
# Calling handling_outliers and passing parameter to see handling_outliers removed outlier or not
df_1 = handling_outliers(df_1 , display= True )
After dropping just Specific Columns
This code calls a function named handling_outliers and passes several parameters to it:
df_1 is a Pandas DataFrame that will be passed to the function for outlier handling.
display=True specifies that the function should print information about the handling of outliers to the console.
The function will then perform outlier handling on the DataFrame and return a new DataFrame df_1 that has had the outliers removed. The display=True parameter will print information about the handling of outliers to the console, allowing the user to see if any outliers were removed.
# Calculate the minimum and maximum prices for each city in df_copy
df_1 = [df_copy.groupby("city")["price"].min(), df_copy.groupby("city")["price"].max()]
df_1 = pd.DataFrame(df_1).round()
df_1.index = ['Min Price', 'Max Price']
df_1 = df_1.T
# Calculate the average price for each city in 2014 and 2015
avg_2014 = (df_copy[df_copy['tr_year'] == 2014]).groupby('city')['price'].mean()
avg_2015 = (df_copy[df_copy['tr_year'] == 2015]).groupby('city')['price'].mean()
# Combine the average prices for 2014 and 2015 into a single DataFrame
avg = pd.DataFrame({'Avg_price_2014': avg_2014, 'Avg_price_2015': avg_2015}).round()
# Merge the minimum and maximum prices and average prices into a single DataFrame
df_1 = pd.merge(df_1, avg, right_index=True, left_index=True)
# Return the DataFrame with the minimum, maximum, and average prices for each city
df_1
| Min Price | Max Price | Avg_price_2014 | Avg_price_2015 | |
|---|---|---|---|---|
| city | ||||
| Auburn | 90000 | 930000 | 290465.00000 | 293466.00000 |
| Bellevue | 247500 | 7062500 | 868641.00000 | 964244.00000 |
| Black Diamond | 135000 | 935000 | 423160.00000 | 424491.00000 |
| Bothell | 245500 | 1075000 | 484805.00000 | 505211.00000 |
| Carnation | 80000 | 1680000 | 457490.00000 | 450234.00000 |
| Duvall | 119500 | 1015000 | 425077.00000 | 424247.00000 |
| Enumclaw | 75000 | 858000 | 315381.00000 | 316340.00000 |
| Fall City | 142000 | 1862000 | 550451.00000 | 629035.00000 |
| Federal Way | 86500 | 1275000 | 288659.00000 | 290833.00000 |
| Issaquah | 130000 | 2700000 | 613217.00000 | 619490.00000 |
| Kenmore | 160000 | 1600000 | 454812.00000 | 476956.00000 |
| Kent | 85000 | 859000 | 295941.00000 | 305853.00000 |
| Kirkland | 90000 | 5110800 | 636751.00000 | 667882.00000 |
| Maple Valley | 110000 | 1350000 | 362377.00000 | 374816.00000 |
| Medina | 787500 | 6885000 | 2347732.00000 | 1628019.00000 |
| Mercer Island | 500000 | 5300000 | 1187996.00000 | 1208438.00000 |
| North Bend | 175000 | 1950000 | 424430.00000 | 484867.00000 |
| Redmond | 170000 | 2280000 | 656615.00000 | 664390.00000 |
| Renton | 95000 | 3000000 | 404593.00000 | 401266.00000 |
| Sammamish | 280000 | 3200000 | 727859.00000 | 744102.00000 |
| Seattle | 78000 | 7700000 | 532586.00000 | 540007.00000 |
| Snoqualmie | 170000 | 1998000 | 517542.00000 | 546641.00000 |
| Vashon | 160000 | 1379900 | 478821.00000 | 515311.00000 |
| Woodinville | 200000 | 1920000 | 613481.00000 | 625420.00000 |
In summary, this code calculates the minimum, maximum, and average prices for each city in a DataFrame df_copy that has a price column and a tr_year column. It first calculates the minimum and maximum prices for each city using the groupby() method and creates a DataFrame df_1 to store the results. It then calculates the average price for each city in 2014 and 2015 using the groupby() method and creates a DataFrame avg to store the results. Finally, it merges the minimum and maximum prices and average prices into a single DataFrame df_1 using the merge() method and returns the DataFrame.
# Create a bar plot of the DataFrame df_1
df_1.plot(kind='bar', figsize=(20,10))
# Rotate the x-axis labels by 45 degrees
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23]),
[Text(0, 0, 'Auburn'),
Text(1, 0, 'Bellevue'),
Text(2, 0, 'Black Diamond'),
Text(3, 0, 'Bothell'),
Text(4, 0, 'Carnation'),
Text(5, 0, 'Duvall'),
Text(6, 0, 'Enumclaw'),
Text(7, 0, 'Fall City'),
Text(8, 0, 'Federal Way'),
Text(9, 0, 'Issaquah'),
Text(10, 0, 'Kenmore'),
Text(11, 0, 'Kent'),
Text(12, 0, 'Kirkland'),
Text(13, 0, 'Maple Valley'),
Text(14, 0, 'Medina'),
Text(15, 0, 'Mercer Island'),
Text(16, 0, 'North Bend'),
Text(17, 0, 'Redmond'),
Text(18, 0, 'Renton'),
Text(19, 0, 'Sammamish'),
Text(20, 0, 'Seattle'),
Text(21, 0, 'Snoqualmie'),
Text(22, 0, 'Vashon'),
Text(23, 0, 'Woodinville')])
he overall code creates a bar plot of df_1 with rotated x-axis labels. The resulting plot shows the minimum, maximum, and average prices for each city in a visual format.
# Calculate the percentage change in average house prices from 2014 to 2015 for each city
df_1['change_percentage'] = round(((df_1['Avg_price_2015'] - df_1['Avg_price_2014'])/df_1['Avg_price_2014'])*100 , 2)
# Reset the index of the DataFrame
df_1 = df_1.reset_index()
# Create a line plot showing the average house prices for each city in 2014 and 2015
# create fig size
plt.figure(figsize=(12,8))
# select per city and average price in 2014
plt.plot(df_1['city'], df_1['Avg_price_2014'], label='2014')
# select per city and average price in 2015
plt.plot(df_1['city'], df_1['Avg_price_2015'], label='2015')
# create rotation =90
plt.xticks(rotation=90)
plt.legend()
# X label Name 'City'
plt.xlabel('City')
# Y label Name 'Average House Price'
plt.ylabel('Average House Price')
# Title NAme 'Change in Average House Prices from 2014 to 2015'
plt.title('Change in Average House Prices from 2014 to 2015')
# display
plt.show()
the overall code calculates the percentage change in average house prices from 2014 to 2015 for each city, creates a line plot showing the average house prices for each city in 2014 and 2015, and adds labels and a title to the plot. The resulting plot shows the change in average house prices from 2014 to 2015 for each city in a visual format.
# Calculate the percentage change in average house prices from 2014 to 2015 for each city
df_1['change_percentage'] = round(((df_1['Avg_price_2015'] - df_1['Avg_price_2014'])/df_1['Avg_price_2014'])*100 , 2)
# Create a new figure with a specified size
plt.figure(figsize=(24,40))
# Adjust the spacing between subplots
plt.subplots_adjust(hspace=.5, wspace=0.1)
# Loop through each row of the DataFrame and create a horizontal bar plot for each city
for i in df_1.index:
# Get the values for the horizontal bar plot and the percentage change
value = [df_1['Avg_price_2014'][i], df_1['Avg_price_2015'][i]]
p = df_1['change_percentage'][i]
# Determine whether the percentage change is positive or negative and set the arrow direction and text accordingly
if p > 0:
a = 'Increased'
t = '<-'
else:
a = 'Decreased'
t = '->'
# Determine the number of rows and columns for the subplots and create a new subplot
x = math.ceil(df_1.shape[0]/2)
plt.subplot(x, 2, i+1)
# Create a horizontal bar plot with the values and colors for each year
ax = plt.barh(['Avg_price_2014', 'Avg_price_2015'], value, color=['tab:gray', 'tab:blue'])
ax = plt.gcf().gca()
# Annotate the percentage change with an arrow and text
ax.annotate('{} by {}%'.format(a, p),
xy=(0, 'Avg_price_2015'),
textcoords='axes fraction',
xytext=(0.8, 0.788),
arrowprops=dict(facecolor='orange', lw=6, arrowstyle=t),
horizontalalignment='right')
# Set the title of the subplot to the city name
ax.set_title(df_1['city'][i])
This code calculates the percentage change in average house prices from 2014 to 2015 for each city in the Pandas DataFrame df_1, and then creates a horizontal bar plot for each city that visualizes the change in prices.
The first line of code creates a new column in df_1 called 'change_percentage' that calculates the percentage change in average house prices from 2014 to 2015.
The second line of code creates a new figure with a specified size using the figure() function from the Matplotlib library. The figsize parameter specifies the width and height of the figure in inches.
The third line of code adjusts the spacing between subplots using the subplots_adjust() function from Matplotlib. The hspace and wspace parameters control the vertical and horizontal spacing between subplots, respectively.
The fourth line of code initiates a loop that iterates over each row in the DataFrame df_1 using the index attribute of the DataFrame.
The fifth line of code gets the values for the horizontal bar plot and the percentage change for the current city. The value variable is a list containing the average house prices for 2014 and 2015, and the p variable is the percentage change in prices for the current city.
The sixth line of code determines whether the percentage change is positive or negative and sets the arrow directionand text accordingly. If the percentage change is positive, the arrow direction is set to point left (indicating an increase) and the text is set to 'Increased'. Otherwise, the arrow direction is set to point right (indicating a decrease) and the text is set to 'Decreased'.
The seventh line of code determines the number of rows and columns for the subplots and creates a new subplot using the subplot() function from Matplotlib. The ceil() function from the math module is used to round up the number of rows to the nearest integer.
The eighth line of code creates a horizontal bar plot using the barh() function from Matplotlib. The barh() function creates a horizontal bar plot where the first argument is a list of y-values and the second argument is a list of corresponding x-values. In this case, the y-values are the strings 'Avg_price_2014' and 'Avg_price_2015', and the x-values are the average house prices for 2014 and 2015. The color parameter specifies the colors of the bars, with 'tab:gray' representing the color for 2014 and 'tab:blue' representing the color for 2015. The resulting ax variable contains the axis object for the current subplot.
The ninth line of code annotates the percentage change with an arrow and text using the annotate() function from Matplotlib. The annotate() function adds the annotation to the plot and takes several arguments. The xy parameter specifies the location of the arrow, which is set to (0, 'Avg_price_2015') to indicate that the arrow starts at the left side of the plot and points towards the 'Avg_price_2015' bar. The textcoords parameter specifies the coordinate system for the text, which is set to 'axes fraction' to indicate that the text position is relative to the axis. The xytext parameter specifies the location of the text, which is set to (0.8, 0.788) to position the text to the right of the arrow. The arrowprops parameter controls the appearance of the arrow, including its color, thickness, and style, and is set to an orange face color with a thickness of 6 and an arrow style determined by the t variable. Finally, the horizontalalignment parameter specifies the horizontal alignment of the text relative to the arrow.
The tenth line of code sets the title of the subplot to the city name using the set_title() method of the axis object. The city name is obtained from the 'city' column of the DataFrame df_1.
Overall, this code calculates the percentage change in average house prices from 2014 to 2015 for each city in df_1, and then creates a horizontal bar plot for each city that visualizes the change in pricesin a clear and concise manner. The annotations and arrow directions help to make it easy to quickly interpret the direction and magnitude of the price changes. The code also uses various functions and methods from the Matplotlib and math libraries to create, customize, and adjust the subplots and visualizations.
# print describe of 'price'
df_copy['price'].describe().round()
count 21611.00000 mean 540085.00000 std 367143.00000 min 75000.00000 25% 321725.00000 50% 450000.00000 75% 645000.00000 max 7700000.00000 Name: price, dtype: float64
The first part of the code accesses the 'price' column of the DataFrame df_copy using the square bracket notation and passes it as an argument to the describe() method.
The describe() method calculates and returns a DataFrame with summary statistics for the 'price' column. These statistics include the count of non-null values, the mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.
The round() function is used to round the summary statistics to the nearest integer. This is achieved by chaining the round() function to the end of the describe() method call using the dot notation.
# Create a deep copy of the DataFrame df_copy
df_3 = df_copy.copy(deep=True)
# Create a new column in the DataFrame df_3 that categorizes the prices of the houses
df_3['cat_price'] = pd.cut(x=df_copy['price'], bins=[0,230000,450000,900000,df_copy['price'].max()],
labels=['Affordable', 'Mid-Priced', 'Expensive', 'Luxury'])
the overall code creates a new DataFrame df_3 that is a deep copy of an existing DataFrame df_copy and adds a new column to df_3 that categorizes the prices of the houses into four categories: "Affordable", "Mid-Priced", "Expensive", and "Luxury"
# Create a new figure with a specified size
plt.figure(figsize=(24,10))
# Create a countplot of the price categories in the DataFrame df_3 using Seaborn
sns.countplot(x='cat_price', data=df_3);
the overall code creates a countplot showing the frequency of each price category in the DataFrame df_3. The resulting plot provides a visual summary of the distribution of house prices in df_3.
# Define a list of the price categories
cat = ['Affordable', 'Mid-Priced', 'Expensive', 'Luxury']
# Loop through each price category and create a line plot of the average house prices over time
for i in cat:
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Get the average house prices over time for the current price category
x = (df_3[df_3['cat_price'] == i]).groupby('date')['price'].mean()
# Add labels and annotations to the plot
plt.xlabel("Date")
plt.ylabel("Price")
plt.axhline(y=x.mean(), color='r', linewidth=10, label='Average')
plt.title(i)
# Plot the average house prices over time
plt.plot(x, color='gray', label='Price', linewidth=5)
# Add grid lines and a legend to the plot
plt.grid(color='black', linestyle='--', linewidth=0.2)
plt.legend()
he overall code creates a set of line plots showing the average house prices over time for each price category in the DataFrame df_3. Each plot includes a horizontal line at the average house price for the corresponding price category and a legend indicating the average and actual house prices over time. The resulting plots provide a visual summary of how house prices have varied over time for different price categories.
# Get the total house prices over time for the years 2014 and 2015
x = (df_copy[df_copy['tr_year'] == 2014]).groupby('date')['price'].sum()
y = (df_copy[df_copy['tr_year'] == 2015]).groupby('date')['price'].sum()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Add labels and annotations to the plot
# create Xlabel Name 'Date'
plt.xlabel("Date")
#create Ylabel NAme 'Price'
plt.ylabel("Price")
# create a x and color and wideh
plt.plot(x, color='green', label='T_Price_2014', linewidth=10)
plt.plot(y, color='yellow', label='T_Price_2015', linewidth=10)
# in grid passing Color and line stle and width
plt.grid(color='black', linestyle='--', linewidth=0.2)
plt.legend()
<matplotlib.legend.Legend at 0x1be1242c310>
the overall code creates a line plot showing the total house prices over time for two different years in the DataFrame df_copy. The resulting plot provides a visual summary of how the total house prices have varied over time between the years 2014 and 2015.
# Create a deep copy of the DataFrame df_copy
df_avg = df_copy.copy(deep=True)
# Calculate the average price per square foot for living area and lot area
df_avg['avg_living_15'] = (df_copy['price'] / df_copy['sqft_living15']).round(2)
df_avg['avg_lot_15'] = (df_copy['price'] / df_copy['sqft_lot15']).round(2)
# Calculate the overall average price per square foot
df_avg['price_sqfr_avg'] = ((df_avg['avg_living_15'] + df_avg['avg_lot_15'])/2).round(2)
# Get the average price per square foot over time
x = df_avg.groupby('date')['price_sqfr_avg'].mean()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Add labels and annotations to the plot
plt.xlabel("Date")
plt.ylabel("Average Price per Sqft")
plt.grid(color='green', linestyle='--', linewidth=0.2)
plt.plot(x, color='r', label='Price per Sqft', linewidth=10);
the overall code creates a line plot showing the average price per square foot over time in the DataFrame df_avg. The resulting plot provides a visual summary of how the average price per square foot has varied over time, which can be useful for understanding trends in the housing market.
# displau linear model plot
sns.lmplot(x='sqft_lot15',y='price',data=df_avg)
<seaborn.axisgrid.FacetGrid at 0x1be0e4233a0>
bad Relation
# Group the DataFrame df_avg by city and get the mean price per square foot
city_sqft = df_avg.groupby('city')['price_sqfr_avg'].mean()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Extract the x and y values for the bar plot
x = city_sqft.index.to_list()
y = city_sqft.to_list()
# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)
# Add labels and annotations to the plot
plt.xticks(rotation=45)
plt.xlabel("City")
plt.ylabel("Average Price per Sqft")
plt.grid(color='black', linestyle='--', linewidth=0.8, axis='y')
the overall code creates a bar plot showing the average price per square foot for each city in the DataFrame df_avg. The resulting plot provides a visual summary of how the average price per square foot varies between different cities, which can be useful for understanding regional differences in the housing market.
# Create a new figure with a specified size
plt.figure(figsize=(12, 10))
# Create a histogram using Seaborn
sns.histplot(df_copy['yr_built'], color='red')
# Add grid lines to the plot
plt.grid(color='black', linestyle='--', linewidth=0.5)
the overall code creates a histogram showing the distribution of the yr_built variable in the DataFrame df_copy. The resulting plot provides a visual summary of when the houses in the dataset were built and how many houses were built in each year.
# plot a linear model plaot
sns.lmplot(x='yr_built',y='price',data=df_copy)
<seaborn.axisgrid.FacetGrid at 0x1be12124940>
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Create two histograms using Seaborn, one for houses renovated by living area and one for houses renovated by lot area
sns.histplot(data=df_copy, x='yr_built', color='b', label="Houses that renovated by living")
sns.histplot(data=df_copy, x='yr_built', color='r', label="Houses that renovated by lot")
# Add grid lines to the plot
plt.grid(color='black', linestyle='--', linewidth=0.8, axis='y')
# Add a legend to the plot
plt.legend()
<matplotlib.legend.Legend at 0x1be1255fd30>
the overall code creates a histogram showing the distribution of the yr_built variable in the DataFrame df_copy for houses that were renovated by living area and houses that were renovated by lot area. The resulting plot provides a visual comparison of the distribution of building years for the two subsets of houses, which can be useful for understanding the renovation patterns in the housing market.
# Create a deep copy of the DataFrame to avoid modifying the original data
df_age = df_copy.copy(deep=True)
# Categorize the yr_built variable into different age groups using the cut() function from Pandas
df_age['age'] = pd.cut(x=df_age['yr_built'], bins=[0, 1939, 1949, 1959, 1969, 1979, 1989, 1999, 2009, df_age['yr_built'].max()],
labels=['1939 Or Earlier', '1940s', '1950s', '1960s', '1970s', '1980s', '1990s', '2000s', '2010 Or Later'])
# Group the DataFrame by age and calculate the mean price for each age group
df_new = df_age.groupby('age')['price'].mean()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()
# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)
# Add a title, labels, and annotations to the plot
plt.suptitle("House Age vs Average Price")
plt.xticks(rotation=45)
plt.xlabel("Age Group")
plt.ylabel("Average Price")
plt.grid(color='red', linestyle='--', linewidth=0.5, axis='y')
the overall code creates a bar plot showing the average price for each age group in the DataFrame df_age. The resulting plot provides a visual summary of how the average price varies for houses built in different time periods, which can be useful for understanding the relationship between house age and price in the housing market.
# Calculate the number of rows in the DataFrame for each year
y = [df_copy[df_copy['tr_year'] == 2014].shape[0], df_copy[df_copy['tr_year'] == 2015].shape[0]]
# Specify labels for the pie chart
mylabels = ["2014", "2015"]
# Specify an offset for the first slice of the pie chart
myexplode = [0.09, 0]
# Create a new figure with a specified size
plt.figure(figsize=(12, 8))
# Create a pie chart using Matplotlib
plt.pie(y, labels=mylabels, explode=myexplode, autopct='%1.1f%%', shadow=True, colors=['c', 'y'])
# Display the plot
plt.show()
the overall code creates a pie chart showing the distribution of the tr_year variable in the DataFrame df_copy for the years 2014 and 2015. The resulting plot provides a visual summary of the relative frequency of the two years in the dataset, which can be useful for understanding the temporal distribution of the data.
# Group the DataFrame by cat_price and tr_month, and calculate the value counts for each group
months = df_3.groupby('cat_price')['tr_month'].value_counts().unstack()
# Extract the x and y values for the pie chart
x = months.sum().index.to_list()
y = months.sum().to_list()
# Create a new figure with a specified size
plt.figure(figsize=(12, 8))
# Create a pie chart using Matplotlib
plt.pie(y, labels=x, autopct='%1.1f%%', startangle=0)
# Display the plot
plt.show()
the overall code creates a pie chart showing the distribution of the tr_month variable in the DataFrame df_3 for each category of the cat_price variable. The resulting plot provides a visual summary of the relative frequency of each month for each price category, which can be useful for understanding the temporal distribution of the data within each category.
# Create two new DataFrames based on whether the house has a basement or not
df_base = df_copy[df_copy['sqft_basement'] != 0]
df_no_base = df_copy[df_copy['sqft_basement'] == 0]
# Calculate the ratio of basement square footage to total living square footage for houses with basements and store it in a new column
df_base['base_sqft_living'] = (df_base['sqft_basement'] / df_base['sqft_living']).round(2)
# Calculate the percentage-based price for houses with basements and store it in a new column
df_base['total_Pbase_price'] = ((df_base['price'] * df_base['base_sqft_living'])/df_base['price']).round(2)
# Calculate the mean percentage-based price for houses with basements and convert it to a percentage rounded to two decimal places
x = round(df_base['total_Pbase_price'].mean()*100 , 2)
the overall code creates a new DataFrame called df_base that includes only the rows from df_copy where the sqft_basement column is not equal to zero, calculates the ratio of the sqft_basement column to the sqft_living column for houses with basements, calculates the percentage-based price for each house with a basement, and calculates the mean percentage-based price for houses with basements. The resulting value is stored in the variable x. This analysis can be useful for understanding the relative value of houses with basements compared to those without basements.
# Calculate the mean percentage-based price for houses with basements and convert it to a percentage rounded to two decimal places
x = round(df_base['total_Pbase_price'].mean()*100 , 2)
# Specify labels for the pie chart
mylabels = ["Basement Representation of Total price", ""]
# Specify an offset for the first slice of the pie chart
myexplode = [0.09, 0]
# Create a new figure with a specified size
plt.figure(figsize=(12, 8))
# Create a pie chart using Matplotlib
plt.pie([x, 100-x], labels=mylabels, explode=myexplode, autopct='%1.1f%%', shadow=True, colors=['c', 'y'])
# Display the plot
plt.show()
the overall code creates a pie chart showing the percentage of the total house price that is represented by the basement for houses with basements in the DataFrame df_base. The resulting plot provides a visual summary of the relative value of the basement compared to the rest of the house for houses with basements, which can be useful for understanding the importance of the basement in the overall value of the house.
# Count grade
df_copy['grade'].value_counts()
7 8980 8 6067 9 2615 6 2038 10 1134 11 399 5 242 12 90 4 29 13 13 3 3 1 1 Name: grade, dtype: int64
# Create a new DataFrame that includes only the rows where the grade is less than 11
df_grade = df_copy[df_copy['grade'] < 11]
# Group the resulting DataFrame by grade and calculate the mean price for each group
df_grade = df_grade.groupby('grade')['price'].mean()
the overall code creates a new DataFrame called df_grade that includes only the rows from df_copy where the grade column is less than 11, and calculates the mean price for each grade level in the resulting DataFrame. This analysis can be useful for understanding the relationship between the grade variable and the price variable for low-grade houses.
# Create a new DataFrame that includes only the rows where the grade is less than 11
df_grade = df_copy[df_copy['grade'] < 11]
# Group the resulting DataFrame by grade and calculate the mean price for each group
df_grade = df_grade.groupby('grade')['price'].mean()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Extract the x and y values for the bar plot
x = df_grade.index.to_list()
y = df_grade.to_list()
# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)
# Add a title to the plot
plt.suptitle("Grade vs Average Price")
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)
# Add a grid to the plot
plt.grid(color='black', linestyle='--', linewidth=0.8, axis='y')
the overall code creates a bar plot that shows the relationship between the grade variable and the average price for houses with a grade less than 11 in the DataFrame df_copy. The resulting plot can be useful for understanding the relationship between the grade variable and the price variable for low-grade houses, and for identifying any patterns or trends in the data.
# create count per bedrooms
df_copy['bedrooms'].value_counts()
3 9823 4 6881 2 2760 5 1601 6 272 1 199 7 38 0 13 8 13 9 6 10 3 11 1 33 1 Name: bedrooms, dtype: int64
this code provides a quick and convenient way to obtain the count of the number of properties in df_copy with a certain number of bedrooms, which can be useful for understanding the distribution of properties across different numbers of bedrooms and for identifying any outliers or anomalies in the data.
# select when bedrooms<=8
df_new = df_copy[df_copy['bedrooms'] <=8 ]
This code creates a new DataFrame called df_new that includes only the rows from a DataFrame called df_copy where the value in the 'bedrooms' column is less than or equal to 8.
The code accesses the 'bedrooms' column of the DataFrame df_copy using square bracket notation.
The code then creates a Boolean mask by applying the comparison operator <= to the 'bedrooms' column and the integer value 8. This comparison operator returns a Boolean value of True or False for each row in the column, depending on whether the value in that row is less than or equal to 8.
The Boolean mask is then used to select only the rows from df_copy where the value in the 'bedrooms' column is less than or equal to 8, creating a new DataFrame called df_new.
Overall, this code provides a way to filter a DataFrame to include only the rows that meet a certain condition, in this case, when the number of bedrooms is less than or equal to 8. This can be useful for cleaning and preparing data for analysis by removing any outliers or invalid data points.
# Group the DataFrame by bedrooms and calculate the mean price for each group
df_new = df_new.groupby('bedrooms')['price'].mean()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()
# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)
# Add a title to the plot
plt.suptitle("Bedrooms vs Average Price")
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)
# Add a grid to the plot
plt.grid(color='red', linestyle='--', linewidth=0.8, axis='y')
the overall code creates a bar plot that shows the relationship between the bedrooms variable and the average price in the DataFrame df_new. The resulting plot can be useful for understanding the relationship between the bedrooms variable and the price variable, and for identifying any patterns or trends in the data.
# create count to bathroom
df_copy['bathrooms'].value_counts()
2.50000 5379 1.00000 3851 1.75000 3048 2.25000 2047 2.00000 1930 1.50000 1446 2.75000 1185 3.00000 753 3.50000 731 3.25000 589 3.75000 155 4.00000 136 4.50000 100 4.25000 79 0.75000 72 4.75000 23 5.00000 21 5.25000 13 0.00000 10 5.50000 10 1.25000 9 6.00000 6 0.50000 4 5.75000 4 6.75000 2 8.00000 2 6.25000 2 6.50000 2 7.50000 1 7.75000 1 Name: bathrooms, dtype: int64
This code creates a frequency table of the 'bathrooms' column in a Pandas DataFrame called df_copy.
The first part of the code accesses the 'bathrooms' column of the DataFrame df_copy using square bracket notation.
The value_counts() method is called on the 'bathrooms' column to count the number of occurrences of each unique value in the column and returns a Pandas Series where the unique values are the index and the corresponding counts are the values.
The resulting frequency table is a Pandas Series object that is printed to the console, showing the number of times each unique value in the 'bathrooms' column appears in the DataFrame.
Overall, this code provides a quick and convenient way to obtain the count of the number of properties in df_copy with a certain number of bathrooms, which can be useful for understanding the distribution of properties across different numbers of bathrooms and for identifying any outliers or anomalies in the data.
# Filter the original DataFrame to include only rows with at least one bathroom
df_new = df_copy[df_copy['bathrooms'] >= 1]
# Filter the resulting DataFrame to exclude rows with more than 4 bathrooms
df_new = df_new[df_new['bathrooms'] < 5]
the overall code filters the original DataFrame to include only the rows with at least one bathroom, and then further filters the resulting DataFrame to exclude rows with more than 4 bathrooms. This can be useful for creating a new DataFrame that focuses on houses with a reasonable number of bathrooms, which could be relevant for some types of analyses.
# Group the DataFrame by bathrooms and calculate the mean price for each group
df_new = df_new.groupby('bathrooms')['price'].mean()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()
# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)
# Add a title to the plot
plt.suptitle("bathrooms vs Average Price")
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)
# Add a grid to the plot
plt.grid(color='red', linestyle='--', linewidth=0.8, axis='y')
the overall code creates a bar plot that shows the relationship between the bathrooms variable and the average price in the DataFrame df_new. The resulting plot can be useful for understanding the relationship between the bathrooms variable and the price variable, and for identifying any patterns or trends in the data.
# Group the DataFrame by floors and calculate the mean price for each group
df_new = df_copy.groupby('floors')['price'].mean()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()
# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)
# Add a title to the plot
plt.suptitle("floors vs Average Price")
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)
# Add a grid to the plot
plt.grid(color='red', linestyle='--', linewidth=0.8, axis='y')
the overall code creates a bar plot that shows the relationship between the floors variable and the average price in the DataFrame df_copy. The resulting plot can be useful for understanding the relationship between the floors variable and the price variable, and for identifying any patterns or trends in the data.
# Group the DataFrame by condition and calculate the mean price for each group
df_new = df_copy.groupby('condition')['price'].mean()
# Create a new figure with a specified size
plt.figure(figsize=(24, 10))
# Extract the x and y values for the bar plot
x = df_new.index.to_list()
y = df_new.to_list()
# Create a bar plot using Seaborn
sns.barplot(x=x, y=y)
# Add a title to the plot
plt.suptitle("condition vs Average Price")
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)
# Add a grid to the plot
plt.grid(color='black', linestyle='--', linewidth=0.8, axis='y')
the overall code creates a bar plot that shows the relationship between the condition variable and the average price in the DataFrame df_copy. The resulting plot can be useful for understanding the relationship between the condition variable and the price variable, and for identifying any patterns or trends in the data.
# Create a new figure with a specified size
fig = plt.figure(figsize=(24, 70))
# Extract the column names of the DataFrame, excluding 'price'
y = df_copy.columns.to_list()
y.remove('price')
# Adjust the spacing between the subplots
ax1 = plt.subplots_adjust(hspace=0.7)
# Loop over each variable in the DataFrame
for i in y:
# Group the DataFrame by the current variable and calculate the mean price for each group
x = df_copy.groupby(i)['price'].mean()
x = pd.DataFrame(x)
# Calculate the index of the current variable in the list of column names
a = y.index(i)
# Create a new subplot with the appropriate title, labels, and color
ax1 = plt.subplot(15, 2, a+1)
ax1.set_ylabel("Average Price")
ax1.set_xlabel(i)
ax1.set_title("Average Price Vs {}".format(i))
if y.index(i) % 2 != 0:
col = 'red'
else:
col = 'blue'
# Plot the mean price for each group as a line plot
ax1.plot(x.index, x.price, color=col, label='Price', linewidth=10)
the overall code creates a grid of subplots that show the relationship between each variable in the DataFrame df_copy (except for the price variable) and the average price. The resulting grid of subplots can be useful for understanding the relationship between each variable and the price variable, and for identifying any patterns or trends in the data.
# Drop column Data
df_copy.drop('date',axis=1,inplace=True)
# print data
df_copy
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | view | condition | grade | sqft_above | ... | long | sqft_living15 | sqft_lot15 | city | state | county | population | population_density | tr_year | tr_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 221900 | 3 | 1.00000 | 1180 | 5650 | 1.00000 | 0 | 3 | 7 | 1180.00000 | ... | -122.25700 | 1340 | 5650 | Seattle | WA | King County | 24092 | 4966 | 2014 | 10 |
| 1 | 538000 | 3 | 2.25000 | 2570 | 7242 | 2.00000 | 0 | 3 | 7 | 2170.00000 | ... | -122.31900 | 1690 | 7639 | Seattle | WA | King County | 37081 | 6879 | 2014 | 12 |
| 2 | 180000 | 2 | 1.00000 | 770 | 10000 | 1.00000 | 0 | 3 | 6 | 770.00000 | ... | -122.23300 | 2720 | 8062 | Kenmore | WA | King County | 20419 | 3606 | 2015 | 2 |
| 3 | 604000 | 4 | 3.00000 | 1960 | 5000 | 1.00000 | 0 | 5 | 7 | 1050.00000 | ... | -122.39300 | 1360 | 5000 | Seattle | WA | King County | 14770 | 6425 | 2014 | 12 |
| 4 | 510000 | 3 | 2.00000 | 1680 | 8080 | 1.00000 | 0 | 3 | 8 | 1680.00000 | ... | -122.04500 | 1800 | 7503 | Sammamish | WA | King County | 25748 | 2411 | 2015 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 360000 | 3 | 2.50000 | 1530 | 1131 | 3.00000 | 0 | 3 | 8 | 1530.00000 | ... | -122.34600 | 1530 | 1509 | Seattle | WA | King County | 45911 | 9905 | 2014 | 5 |
| 21609 | 400000 | 4 | 2.50000 | 2310 | 5813 | 2.00000 | 0 | 3 | 8 | 2310.00000 | ... | -122.36200 | 1830 | 7200 | Seattle | WA | King County | 25922 | 5573 | 2015 | 2 |
| 21610 | 402101 | 2 | 0.75000 | 1020 | 1350 | 2.00000 | 0 | 3 | 7 | 1020.00000 | ... | -122.29900 | 1020 | 2007 | Seattle | WA | King County | 26881 | 7895 | 2014 | 6 |
| 21611 | 400000 | 3 | 2.50000 | 1600 | 2388 | 2.00000 | 0 | 3 | 8 | 1600.00000 | ... | -122.06900 | 1410 | 1287 | Issaquah | WA | King County | 26141 | 469 | 2015 | 1 |
| 21612 | 325000 | 2 | 0.75000 | 1020 | 1076 | 2.00000 | 0 | 3 | 7 | 1020.00000 | ... | -122.29900 | 1020 | 1357 | Seattle | WA | King County | 26881 | 7895 | 2014 | 10 |
21611 rows × 24 columns
This code drops the 'date' column from a Pandas DataFrame called df_copy and then prints the resulting DataFrame to the console.
The drop() method is called on df_copy to remove the 'date' column from the DataFrame. The axis=1 parameter specifies that the column should be dropped, and the inplace=True parameter specifies that the operation should be performed in place on the original DataFrame rather than returning a new DataFrame.
The resulting DataFrame with the 'date' column removed is printed to the console using the print() function.
Overall, this code provides a way to remove a column from a DataFrame, which can be useful for cleaning and preparing data for analysis by removing any irrelevant or redundant columns.
# Create a label encoder object
le = LabelEncoder()
# Apply label encoding to the 'city' column in the dataframe 'df_copy'
df_copy['city_M'] = le.fit_transform(df_copy['city'])
# The label encoded values will be stored in a new column called 'city_M'
# Output the encoded dataframe
df_copy
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | view | condition | grade | sqft_above | ... | sqft_living15 | sqft_lot15 | city | state | county | population | population_density | tr_year | tr_month | city_M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 221900 | 3 | 1.00000 | 1180 | 5650 | 1.00000 | 0 | 3 | 7 | 1180.00000 | ... | 1340 | 5650 | Seattle | WA | King County | 24092 | 4966 | 2014 | 10 | 20 |
| 1 | 538000 | 3 | 2.25000 | 2570 | 7242 | 2.00000 | 0 | 3 | 7 | 2170.00000 | ... | 1690 | 7639 | Seattle | WA | King County | 37081 | 6879 | 2014 | 12 | 20 |
| 2 | 180000 | 2 | 1.00000 | 770 | 10000 | 1.00000 | 0 | 3 | 6 | 770.00000 | ... | 2720 | 8062 | Kenmore | WA | King County | 20419 | 3606 | 2015 | 2 | 10 |
| 3 | 604000 | 4 | 3.00000 | 1960 | 5000 | 1.00000 | 0 | 5 | 7 | 1050.00000 | ... | 1360 | 5000 | Seattle | WA | King County | 14770 | 6425 | 2014 | 12 | 20 |
| 4 | 510000 | 3 | 2.00000 | 1680 | 8080 | 1.00000 | 0 | 3 | 8 | 1680.00000 | ... | 1800 | 7503 | Sammamish | WA | King County | 25748 | 2411 | 2015 | 2 | 19 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 360000 | 3 | 2.50000 | 1530 | 1131 | 3.00000 | 0 | 3 | 8 | 1530.00000 | ... | 1530 | 1509 | Seattle | WA | King County | 45911 | 9905 | 2014 | 5 | 20 |
| 21609 | 400000 | 4 | 2.50000 | 2310 | 5813 | 2.00000 | 0 | 3 | 8 | 2310.00000 | ... | 1830 | 7200 | Seattle | WA | King County | 25922 | 5573 | 2015 | 2 | 20 |
| 21610 | 402101 | 2 | 0.75000 | 1020 | 1350 | 2.00000 | 0 | 3 | 7 | 1020.00000 | ... | 1020 | 2007 | Seattle | WA | King County | 26881 | 7895 | 2014 | 6 | 20 |
| 21611 | 400000 | 3 | 2.50000 | 1600 | 2388 | 2.00000 | 0 | 3 | 8 | 1600.00000 | ... | 1410 | 1287 | Issaquah | WA | King County | 26141 | 469 | 2015 | 1 | 9 |
| 21612 | 325000 | 2 | 0.75000 | 1020 | 1076 | 2.00000 | 0 | 3 | 7 | 1020.00000 | ... | 1020 | 1357 | Seattle | WA | King County | 26881 | 7895 | 2014 | 10 | 20 |
21611 rows × 25 columns
this code provides a way to encode categorical data into numerical values, which can be useful for machine learning algorithms that require numerical data as input. The label encoder assigns a unique integer to each category, with the smallest integer assigned to the most frequent category.
# Create a label encoder object
le = LabelEncoder()
# Apply label encoding to the 'state' column in the dataframe 'df_copy'
df_copy['state_M'] = le.fit_transform(df_copy['state'])
# The label encoded values will be stored in a new column called 'state_M'
# Output the encoded dataframe
df_copy
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | view | condition | grade | sqft_above | ... | sqft_lot15 | city | state | county | population | population_density | tr_year | tr_month | city_M | state_M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 221900 | 3 | 1.00000 | 1180 | 5650 | 1.00000 | 0 | 3 | 7 | 1180.00000 | ... | 5650 | Seattle | WA | King County | 24092 | 4966 | 2014 | 10 | 20 | 0 |
| 1 | 538000 | 3 | 2.25000 | 2570 | 7242 | 2.00000 | 0 | 3 | 7 | 2170.00000 | ... | 7639 | Seattle | WA | King County | 37081 | 6879 | 2014 | 12 | 20 | 0 |
| 2 | 180000 | 2 | 1.00000 | 770 | 10000 | 1.00000 | 0 | 3 | 6 | 770.00000 | ... | 8062 | Kenmore | WA | King County | 20419 | 3606 | 2015 | 2 | 10 | 0 |
| 3 | 604000 | 4 | 3.00000 | 1960 | 5000 | 1.00000 | 0 | 5 | 7 | 1050.00000 | ... | 5000 | Seattle | WA | King County | 14770 | 6425 | 2014 | 12 | 20 | 0 |
| 4 | 510000 | 3 | 2.00000 | 1680 | 8080 | 1.00000 | 0 | 3 | 8 | 1680.00000 | ... | 7503 | Sammamish | WA | King County | 25748 | 2411 | 2015 | 2 | 19 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 360000 | 3 | 2.50000 | 1530 | 1131 | 3.00000 | 0 | 3 | 8 | 1530.00000 | ... | 1509 | Seattle | WA | King County | 45911 | 9905 | 2014 | 5 | 20 | 0 |
| 21609 | 400000 | 4 | 2.50000 | 2310 | 5813 | 2.00000 | 0 | 3 | 8 | 2310.00000 | ... | 7200 | Seattle | WA | King County | 25922 | 5573 | 2015 | 2 | 20 | 0 |
| 21610 | 402101 | 2 | 0.75000 | 1020 | 1350 | 2.00000 | 0 | 3 | 7 | 1020.00000 | ... | 2007 | Seattle | WA | King County | 26881 | 7895 | 2014 | 6 | 20 | 0 |
| 21611 | 400000 | 3 | 2.50000 | 1600 | 2388 | 2.00000 | 0 | 3 | 8 | 1600.00000 | ... | 1287 | Issaquah | WA | King County | 26141 | 469 | 2015 | 1 | 9 | 0 |
| 21612 | 325000 | 2 | 0.75000 | 1020 | 1076 | 2.00000 | 0 | 3 | 7 | 1020.00000 | ... | 1357 | Seattle | WA | King County | 26881 | 7895 | 2014 | 10 | 20 | 0 |
21611 rows × 26 columns
This code performs label encoding on the values in the 'state' column of a Pandas DataFrame called df_copy.
The first line of code creates a new LabelEncoder object called le. A label encoder is a preprocessing technique that assigns a unique integer value to each category in a categorical feature.
The second line of code applies label encoding to the 'state' column in df_copy by calling the fit_transform() method of the le object on the 'state' column. The fit_transform() method fits the label encoder to the 'state' column and then transforms the category labels into numerical values.
The resulting numerical values are stored in a new column called 'state_M' in df_copy. The column name 'state_M' stands for 'state' after being 'LabelEncoded'.
The final line of code outputs the encoded DataFrame to the console using the print() function.
Overall, this code provides a way to encode categorical data into numerical values, which can be useful for machine learning algorithms that require numerical data as input. The label encoder assigns a unique integer to each category, with the smallest integer assigned to the most frequent category. In this case, the 'state_M' column contains the label encoded values for the 'state' column.
# Create a new feature called 'age' by subtracting the 'yr_built' column from the current year
df_copy['age'] = 2023 - df_copy['yr_built']
# Create a feature called 'total_sqft' that represents the total square footage of the property
df_copy['total_sqft'] = df_copy['sqft_living'] + df_copy['sqft_lot']
# Standardize the numerical features using the StandardScaler
scaler = StandardScaler()
# Choice Columns
num_cols = ['population','bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'age', 'total_sqft']
df_copy[num_cols] = scaler.fit_transform(df_copy[num_cols])
# Print the first few rows of the transformed DataFrame to inspect the data
df_copy.head()
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | view | condition | grade | sqft_above | ... | state | county | population | population_density | tr_year | tr_month | city_M | state_M | age | total_sqft | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 221900 | -0.39871 | -1.44752 | -0.97980 | -0.22833 | -0.91544 | -0.30577 | -0.62916 | -0.55883 | -0.73468 | ... | WA | King County | -0.59278 | 4966 | 2014 | 10 | 20 | 0 | 0.54501 | -0.24904 |
| 1 | 538000 | -0.39871 | 0.17556 | 0.53370 | -0.18989 | 0.93644 | -0.30577 | -0.62916 | -0.55883 | 0.46081 | ... | WA | King County | 0.57437 | 6879 | 2014 | 12 | 20 | 0 | 0.68120 | -0.17734 |
| 2 | 180000 | -1.47390 | -1.44752 | -1.42623 | -0.12331 | -0.91544 | -0.30577 | -0.62916 | -1.40955 | -1.22979 | ... | WA | King County | -0.92283 | 3606 | 2015 | 2 | 10 | 0 | 1.29403 | -0.15431 |
| 3 | 604000 | 0.67648 | 1.14941 | -0.13050 | -0.24402 | -0.91544 | -0.30577 | 2.44426 | -0.55883 | -0.89167 | ... | WA | King County | -1.43043 | 6425 | 2014 | 12 | 20 | 0 | 0.20455 | -0.24591 |
| 4 | 510000 | -0.39871 | -0.14905 | -0.43538 | -0.16966 | -0.91544 | -0.30577 | -0.62916 | 0.29189 | -0.13090 | ... | WA | King County | -0.44398 | 2411 | 2015 | 2 | 19 | 0 | -0.54447 | -0.17859 |
5 rows × 28 columns
This code performs several data preprocessing steps on a Pandas DataFrame called df_copy.
The first line of code creates a new feature called 'age' by subtracting the 'yr_built' column from the current year (2023) and storing the result in a new 'age' column in the DataFrame. This calculates the age of each property in years.
The second line of code creates a new feature called 'total_sqft' that represents the total square footage of the property by summing the 'sqft_living' and 'sqft_lot' columns and storing the result in a new 'total_sqft' column in the DataFrame.
The third line of code creates a StandardScaler object called 'scaler' that will be used to standardize the numerical features in the DataFrame.
The fourth line of code specifies a list of column names called 'num_cols' that contains the names of the numerical features in the DataFrame that should be standardized.
The fifth line of code applies the fit_transform() method of the 'scaler' object to the columns specified in 'num_cols'. This method fits the scaler to the data and then transforms the data to have mean 0 and standard deviation 1.
The resulting standardized numerical features are stored back in the 'num_cols' columns of the DataFrame.
Finally, the last line of code prints the first few rows of the transformed DataFrame to the console using thehead() method to inspect the data.
Overall, this code performs several common data preprocessing steps, including creating new features, standardizing numerical features, and printing the resulting DataFrame. These steps are important for preparing data for analysis by machine learning algorithms, as they can help to improve the accuracy and interpretability of the results.
# create a subplot and figure size
fig,ax=plt.subplots(figsize=(15,10))
# Create a heatmap to corr
sns.heatmap(df_copy.corr(),annot=True,cmap='RdYlGn',fmt='.2f')
<Axes: >
this code provides a convenient way to visualize the correlation between variables in a DataFrame using a heatmap, which can be useful for identifying patterns and relationships in the data. The correlation coefficients range from -1 to 1, with values closer to -1 indicating a negative correlation (inverse relationship), values closer to 1 indicating a positive correlation (direct relationship), and values closer to 0 indicating no correlation.
# choice model to select it to model
model_ft=['bedrooms', 'bathrooms', 'sqft_living','floors','view','grade', 'sqft_above', 'sqft_basement',
'lat','sqft_lot15','population']
# print a columns
print('We will use these Features to build the model : '+str(model_ft))
# print a size of columns
print('Number of features: '+str(len(model_ft)))
We will use these Features to build the model : ['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'view', 'grade', 'sqft_above', 'sqft_basement', 'lat', 'sqft_lot15', 'population'] Number of features: 11
The model_ft variable is a list that contains the names of the features that will be used to build the model. These features were chosen based on their potential importance in predicting the target variable, which is not shown in this code.
The first print() statement outputs a message to the console that displays the list of features that will be used to build the model. The str() function is used to convert the list to a string before concatenating it with the rest of the message.
The second print() statement outputs a message to the console that displays the number of features that will be used to build the model, which is the length of the model_ft list.
Overall, this code provides a way to select a subset of features from a DataFrame for use in a machine learning model, which can help to improve the accuracy and interpretability of the model by reducing the number of irrelevant or redundant features.
# Identify columns with NaN values in X
cols_with_nan = df_copy.columns[df_copy.isna().any()].tolist()
# Drop rows with NaN values from X and y
df_copy.dropna(subset=cols_with_nan, inplace=True)
# create a variable from dataset 'Feature'
X = df_copy[model_ft]
# create a variable from dataset 'Target'
y = df_copy['price']
This code performs data preprocessing tasks to prepare the dataset for machine learning modeling. First, it identifies the columns in the dataset that contain missing values using the isna() method and stores their names in the cols_with_nan list. It then drops all the rows in the dataset that contain missing values in any of the columns specified in cols_with_nan using the dropna() method, which modifies the original dataframe. The resulting dataset is then split into features and target variables using the model_ft list and the 'price' column, respectively. These variables are then used for machine learning modeling.
# Split the data into training and testing sets
# split size Train ==80
# split size Test ==20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
this coode to split dataset to 80% train and 20 test and random state=42
# create an instance of the Logistic Regression class
lor = LogisticRegression()
# fit the linear regression model to the scaled training data
lor.fit(X_train, y_train)
# use the trained model to make predictions on the scaled testing data
lor_pred = lor.predict(X_test)
# calculate the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) scores for the predictions
lor_mse = mean_squared_error(y_test, lor_pred)
lor_rmse = mean_squared_error(y_test, lor_pred, squared=False)
lor_r2 = r2_score(y_test, lor_pred)
# print the results
print('Logistic Regression MSE: {:.2f}'.format(lor_mse))
print('Logistic Regression RMSE: {:.2f}'.format(lor_rmse))
print('Logistic Regression R2: {:.2f}'.format(lor_r2))
Logistic Regression MSE: 152306019.15 Logistic Regression RMSE: 12341.23 Logistic Regression R2: 1.00
This code trains and evaluates a logistic regression model using the scikit-learn library.
An instance of the LogisticRegression() class is created and assigned to the variable lor.
The fit() method is used to train the model on the training set, X_train and y_train.
The predict() method is then used to generate predictions, lor_pred, on the test set, X_test.
The mean_squared_error(), r2_score(), and sqrt() functions are used to compute the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) scores for the predictions. These scores are printed to the console using the print() function.
# create a scatter plot of the actual vs. predicted values for the linear regression model
plt.scatter(y_test, lor_pred)
# add labels and a title to the plot
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Logistic Regression')
# display the plot
plt.show()
This code creates a scatter plot to visualize the performance of a logistic regression model.
The scatter() function from the matplotlib library is used to create a scatter plot of the actual values of the target variable, y_test, against the predicted values of the target variable, lor_pred.
The xlabel(), ylabel(), and title() functions are used to add appropriate axis labels and a title to the plot.
The resulting plot allows for a visual evaluation of the performance of the logistic regression model, and can be used to identify any patterns or trends in the model's predictions.
An analysis of the scatter plot can be used to identify any areas where the model may be over- or under-predicting the target variable, and can guide future improvements to the model.
# create an instance of the LinearRegression class
lr = LinearRegression()
# fit the linear regression model to the scaled training data
lr.fit(X_train, y_train)
# use the trained model to make predictions on the scaled testing data
lr_pred = lr.predict(X_test)
# calculate the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) scores for the predictions
lr_mse = mean_squared_error(y_test, lr_pred)
lr_rmse = mean_squared_error(y_test, lr_pred, squared=False)
lr_r2 = r2_score(y_test, lr_pred)
# print the results
print('Linear Regression MSE: {:.2f}'.format(lr_mse))
print('Linear Regression RMSE: {:.2f}'.format(lr_rmse))
print('Linear Regression R2: {:.2f}'.format(lr_r2))
Linear Regression MSE: 52838556699.08 Linear Regression RMSE: 229866.39 Linear Regression R2: 0.65
This code trains a linear regression model on training data and evaluates its performance on test data by computing three metrics: mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2). The LinearRegression class is used to create the model instance, which is then fit to the training data using the fit() method. Predictions are made on the test data using the predict() method, and the metrics are computed using the appropriate functions. Finally, the metrics are printed to the console using the print() function.
# create a scatter plot of the actual vs. predicted values for the linear regression model
plt.scatter(y_test, lr_pred)
# add labels and a title to the plot
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Linear Regression')
# display the plot
plt.show()
This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the linear regression model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.
# Instantiate a Gradient Boosting Regressor model with 100 estimators and a random state of 42
gb = GradientBoostingRegressor(n_estimators=100, random_state=42)
# Train the model on the scaled training set
gb.fit(X_train, y_train)
# Use the trained model to predict on the scaled test set
gb_pred = gb.predict(X_test)
# Compute the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) metrics
gb_mse = mean_squared_error(y_test, gb_pred)
gb_rmse = mean_squared_error(y_test, gb_pred, squared=False)
gb_r2 = r2_score(y_test, gb_pred)
# Print the computed metrics
print('Gradient Boosting MSE: {:.2f}'.format(gb_mse))
print('Gradient Boosting RMSE: {:.2f}'.format(gb_rmse))
print('Gradient Boosting R2: {:.2f}'.format(gb_r2))
Gradient Boosting MSE: 25071928541.03 Gradient Boosting RMSE: 158341.18 Gradient Boosting R2: 0.83
This code trains a Gradient Boosting Regressor model on the training set and evaluates its performance on the test set. The model is instantiated using the GradientBoostingRegressor class with 100 estimators and a random state of 42. The fit() method is used to train the model on the scaled training set, and the predict() method is used to generate predictions on the scaled test set. The code then computes three performance metrics (MSE, RMSE, and R2) using the appropriate functions from the scikit-learn library. Finally, the computed metrics are printed to the console using the print() function to evaluate the model's performance.
# Visualize the predicted vs. actual values using a scatter plot
plt.scatter(y_test, gb_pred)
# Add axis labels and a title to the plot
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Gradient Boosting Regression')
# Display the plot
plt.show()
This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the Gradient Boosting Regressor model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.
# Instantiate a neural network regressor model with two hidden layers of size 100 and 50, maximum iterations of 1000, and a random state of 42
nn = MLPRegressor(hidden_layer_sizes=(100,50), max_iter=1000, random_state=42)
# Train the model on the scaled training set
nn.fit(X_train, y_train)
# Use the trained model to predict on the scaled test set
nn_pred = nn.predict(X_test)
# Compute the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) metrics
nn_mse = mean_squared_error(y_test, nn_pred)
nn_rmse = mean_squared_error(y_test, nn_pred, squared=False)
nn_r2 = r2_score(y_test, nn_pred)
# Print the computed metrics
print('Neural Network MSE: {:.2f}'.format(nn_mse))
print('Neural Network RMSE: {:.2f}'.format(nn_rmse))
print('Neural Network R2: {:.2f}'.format(nn_r2))
Neural Network MSE: 33912810579.48 Neural Network RMSE: 184154.31 Neural Network R2: 0.77
This code trains a neural network regressor model on the training set and evaluates its performance on the test set. The model is instantiated using the MLPRegressor class with two hidden layers of size 100 and 50, maximum iterations of 1000, and a random state of 42. The fit() method is used to train the model on the scaled training set, and the predict() method is used to generate predictions on the scaled test set. The code then computes three performance metrics (MSE, RMSE, and R2) using the appropriate functions from the scikit-learn library. Finally, the computed metrics are printed to the console using the print() function to evaluate the model's performance.
# Visualize the predicted vs. actual values using a scatter plot
plt.scatter(y_test, nn_pred)
# Add axis labels and a title to the plot
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Neural Network Regression')
# Display the plot
plt.show()
This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the neural network regressor model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.
# Instantiate a random forest regressor model with 100 estimators and a random state of 42
rf = RandomForestRegressor(n_estimators=100, random_state=42)
# Train the model on the scaled training set
rf.fit(X_train, y_train)
# Use the trained model to predict on the scaled test set
rf_pred = rf.predict(X_test)
# Compute the mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) metrics
rf_mse = mean_squared_error(y_test, rf_pred)
rf_rmse = mean_squared_error(y_test, rf_pred, squared=False)
rf_r2 = r2_score(y_test, rf_pred)
# Print the computed metrics
print('Random Forest MSE: {:.2f}'.format(rf_mse))
print('Random Forest RMSE: {:.2f}'.format(rf_rmse))
print('Random Forest R2: {:.2f}'.format(rf_r2))
Random Forest MSE: 22318190987.17 Random Forest RMSE: 149392.74 Random Forest R2: 0.85
This code trains a random forest regressor model on the training set and evaluates its performance on the test set. The model is instantiated using the RandomForestRegressor class with 100 estimators and a random state of 42. The fit() method is used to train the model on the scaled training set, and the predict() method is used to generate predictions on the scaled test set. The code then computes three performance metrics (MSE, RMSE, and R2) using the appropriate functions from the scikit-learn library. Finally, the computed metrics are printed to the console using the print() function to evaluate the model's performance.
# Visualize the predicted vs. actual values for the random forest regression model
plt.scatter(y_test, rf_pred)
# Set the x-axis label
plt.xlabel('Actual values')
# Set the y-axis label
plt.ylabel('Predicted values')
# Set the plot title
plt.title('Random Forest Regression')
# Show the plot
plt.show()
This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the random forest regressor model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.
#This line initializes an instance of the ExtraTreesRegressor class with the specified hyperparameters
ETR = ExtraTreesRegressor(n_estimators=500 , n_jobs= -1 , max_depth=24 ,min_samples_split=8 , min_samples_leaf=9 )
# Fit the support vector regression model to the scaled training set
ETR.fit(X_train, y_train)
# Predict the target values for the test set using the trained model
ETR_pred = ETR.predict(X_test)
# Compute and print the metrics for the model performance evaluation
ETR_mse = mean_squared_error(y_test, ETR_pred)
ETR_rmse = mean_squared_error(y_test, ETR_pred, squared=False)
ETR_r2 = r2_score(y_test, ETR_pred)
# Print the computed metrics
print('Extra Trees Regressor: {:.2f}'.format(ETR_mse))
print('Extra Trees Regression RMSE: {:.2f}'.format(ETR_rmse))
print('Extra Trees Regressor R2: {:.2f}'.format(ETR_r2))
Extra Trees Regressor: 32606447477.30 Extra Trees Regression RMSE: 180572.55 Extra Trees Regressor R2: 0.78
This code trains an Extra Trees Regressor model on the training set and evaluates its performance on the test set. The model is instantiated using the ExtraTreesRegressor class with the specified hyperparameters. The fit() method is used to train the model on the scaled training set, and the predict() method is used to generate predictions on the scaled test set. The code then computes three performance metrics (MSE, RMSE, and R2) using the appropriate functions from the scikit-learn library. Finally, the computed metrics are printed to the console using the print() function to evaluate the model's performance. The Extra Trees Regressor is a type of ensemble learning method that combines multiple decision trees to make more accurate predictions.
# Visualize the predicted vs. actual values for the Support Vector Regression model
# Create a scatter plot with the actual values on the x-axis and the predicted values on the y-axis
plt.scatter(y_test,ETR_pred)
# Set the label for the x-axis
plt.xlabel('Actual values')
# Set the label for the y-axis
plt.ylabel('Predicted values')
# Set the title of the plot
plt.title('Support Vector Regression')
# Display the plot
plt.show()
This code creates a scatter plot to visualize the relationship between the actual and predicted target values for the Extra Trees Regressor model. The scatter() function from the Matplotlib library is used to create the plot. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.
# Compute accuracy score and visualize performance for each model
models = {'Logistic Regression':lor,'Linear Regression': lr, 'Gradient Boosting': gb, 'Neural Network': nn, 'Random Forest': rf, 'Extra Trees Regressor': ETR}
for name, model in models.items():
score = model.score(X_test, y_test)
print('{} Accuracy Score: {:.2f}'.format(name, score))
pred = model.predict(X_test)
plt.scatter(y_test, pred)
plt.xlabel('Actual Salaries')
plt.ylabel('Predicted Salaries')
plt.title(name)
plt.show()
Logistic Regression Accuracy Score: 0.01
Linear Regression Accuracy Score: 0.65
Gradient Boosting Accuracy Score: 0.83
Neural Network Accuracy Score: 0.77
Random Forest Accuracy Score: 0.85
Extra Trees Regressor Accuracy Score: 0.78
This code computes the accuracy score and visualizes the performance for each of the five models used to predict salaries. A dictionary is created containing the name of each model along with its corresponding instance.
For each model, the score() method is called on the test set to calculate the accuracy score, which measures how well the model is able to predict the target variable. The accuracy score is printed to the console using the print() function.
Next, the predict() method is called on the trained model with the test set as an argument to generate predictions. A scatter plot is created using the scatter() function to visualize the relationship between the actual and predicted target values. The x-axis corresponds to the actual values and the y-axis corresponds to the predicted values. The plot is labeled with appropriate axis labels and title using the xlabel(), ylabel(), and title() functions, and then displayed using the show() function.
This process is repeated for each of the five models, allowing for a comparison of their accuracy scores and visual performance. The scatter plots provide a visual representation of how well the models are able to predict salaries, with a more tightly clustered distribution of points indicating better performance.
MSE measures the average squared difference between the predicted values and the actual values in a regression model. It is calculated by taking the sum of the squared differences between the predicted and actual values and dividing by the number of observations.
where y is the actual value, ŷ is the predicted value, and n is the number of observations.
MSE is useful for comparing different models as it penalizes large errors more than small errors. However, MSE has the disadvantage of being difficult to interpret as it is expressed in squared units.
RMSE is the square root of MSE and is therefore expressed in the same units as the dependent variable. RMSE is a popular metric for evaluating the accuracy of predictive models. It is calculated by taking the square root of the MSE.
RMSE is useful because it gives a meaningful interpretation of the magnitude of the prediction errors. However, like MSE, RMSE also does not take into account the variability of the data.
R-squared is a measure of how well the regression model fits the data. It is the proportion of the variance in the dependent variable that is explained by the independent variable(s). R-squared ranges from 0 to 1, with 1 indicating a perfect fit and 0 indicating no fit at all.
# Define the models
models = [lor,lr, gb, nn, rf, ETR]
model_names = ['LOG','LR', 'GB', 'NN', 'RF', 'ETR']
# Create empty lists to store the evaluation metrics
mse_scores = []
rmse_scores = []
r2_scores = []
# Evaluate each model
for model in models:
# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, model.predict(X_test))
mse_scores.append(mse)
# Root Mean Squared Error (RMSE)
rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)
rmse_scores.append(rmse)
# R-Squared (R2) Score
r2 = r2_score(y_test, model.predict(X_test))
r2_scores.append(r2)
# Create a dataframe to store the evaluation metrics
evaluation_df = pd.DataFrame({'Model': model_names,
'MSE': mse_scores,
'RMSE': rmse_scores,
'R2': r2_scores
})
# Print the evaluation metrics for each model
print(evaluation_df)
# Create a bar plot to compare the MSE scores of the models
plt.bar(model_names, mse_scores)
plt.title('Mean Squared Error')
plt.xlabel('Model')
plt.ylabel('MSE')
plt.show()
# Create a bar plot to compare the RMSE scores of the models
plt.bar(model_names, rmse_scores)
plt.title('Root Mean Squared Error')
plt.xlabel('Model')
plt.ylabel('RMSE')
plt.show()
# Create a bar plot to compare the R2 scores of the models
plt.bar(model_names, r2_scores)
plt.title('R-Squared Score')
plt.xlabel('Model')
plt.ylabel('R2')
plt.show()
# Print the best model for each evaluation metric
best_mse_model = evaluation_df.loc[evaluation_df['MSE'].idxmin(), 'Model']
best_rmse_model = evaluation_df.loc[evaluation_df['RMSE'].idxmin(), 'Model']
best_r2_model = evaluation_df.loc[evaluation_df['R2'].idxmax(), 'Model']
print('Best Model (MSE):', best_mse_model)
print('Best Model (RMSE):',best_rmse_model)
print('Best Model (r2):',best_r2_model)
Model MSE RMSE R2 0 LOG 152306019.14851 12341.23248 0.99898 1 LR 52838556699.08271 229866.38880 0.64633 2 GB 25071928541.02862 158341.17765 0.83218 3 NN 33912810579.47690 184154.31187 0.77301 4 RF 22318190987.17298 149392.74074 0.85062 5 ETR 32606447477.29959 180572.55461 0.78175
Best Model (MSE): LOG Best Model (RMSE): LOG Best Model (r2): LOG
This code evaluates the performance of five different models on the task of predicting salaries. Three different evaluation metrics are used, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-Squared (R2) Score, to compare the performance of these models.
For each model, the code computes these three evaluation metrics and stores them in lists. These lists are then used to create a pandas DataFrame that summarizes the performance of each model.
The code also creates three bar plots to visualize the performance of each model for each evaluation metric. Finally, the code prints the best model for each evaluation metric based on the results obtained from the evaluation DataFrame.
This allows for a comprehensive comparison of the performance of each model, enabling users to choose the best model for the prediction task at hand.
#Create scatter plot for each model
plt.scatter(y_test, lor_pred, label='Logistic Regression')
plt.scatter(y_test, lr_pred, label='Linear Regression')
plt.scatter(y_test, gb_pred, label='Gradient Boosting')
plt.scatter(y_test, nn_pred, label='Neural Network')
plt.scatter(y_test, rf_pred, label='Random Forest')
plt.scatter(y_test, ETR_pred, label='Extra Trees Regressor')
#Set plot labels and title
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Comparison of Regression Models based on Actual vs. Predicted values')
#Add legend to the plot
plt.legend()
#Display the plot
plt.show()
This code creates a scatter plot to compare the performance of five different regression models used to predict salaries. The scatter() function from the Matplotlib library is used to create the plot, with each model's predicted values plotted against the actual values on the x- and y-axes, respectively.
The code then uses the label parameter to add a label to each model's scatter plot. The xlabel(), ylabel(), and title() functions are used to set appropriate axis labels and title for the plot.
Finally, the legend() function is used to add a legend to the plot that identifies each model's scatter plot. The legend helps to distinguish between the different models and their corresponding scatter plots. The resulting plot allows for a visual comparison of the performance of each model, providing insights into which models are better suited for the prediction task at hand.
# Take copy of data to classification
df_copy1=df_copy.copy(deep=True)
# print Dataset to check copy
df_copy1
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | view | condition | grade | sqft_above | ... | state | county | population | population_density | tr_year | tr_month | city_M | state_M | age | total_sqft | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 221900 | -0.39871 | -1.44752 | -0.97980 | -0.22833 | -0.91544 | -0.30577 | -0.62916 | -0.55883 | -0.73468 | ... | WA | King County | -0.59278 | 4966 | 2014 | 10 | 20 | 0 | 0.54501 | -0.24904 |
| 1 | 538000 | -0.39871 | 0.17556 | 0.53370 | -0.18989 | 0.93644 | -0.30577 | -0.62916 | -0.55883 | 0.46081 | ... | WA | King County | 0.57437 | 6879 | 2014 | 12 | 20 | 0 | 0.68120 | -0.17734 |
| 2 | 180000 | -1.47390 | -1.44752 | -1.42623 | -0.12331 | -0.91544 | -0.30577 | -0.62916 | -1.40955 | -1.22979 | ... | WA | King County | -0.92283 | 3606 | 2015 | 2 | 10 | 0 | 1.29403 | -0.15431 |
| 3 | 604000 | 0.67648 | 1.14941 | -0.13050 | -0.24402 | -0.91544 | -0.30577 | 2.44426 | -0.55883 | -0.89167 | ... | WA | King County | -1.43043 | 6425 | 2014 | 12 | 20 | 0 | 0.20455 | -0.24591 |
| 4 | 510000 | -0.39871 | -0.14905 | -0.43538 | -0.16966 | -0.91544 | -0.30577 | -0.62916 | 0.29189 | -0.13090 | ... | WA | King County | -0.44398 | 2411 | 2015 | 2 | 19 | 0 | -0.54447 | -0.17859 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | 360000 | -0.39871 | 0.50018 | -0.59871 | -0.33743 | 2.78832 | -0.30577 | -0.62916 | 0.29189 | -0.31203 | ... | WA | King County | 1.36782 | 9905 | 2014 | 5 | 20 | 0 | -1.29349 | -0.34928 |
| 21609 | 400000 | 0.67648 | 0.50018 | 0.25059 | -0.22439 | 0.93644 | -0.30577 | -0.62916 | 0.29189 | 0.62987 | ... | WA | King County | -0.42834 | 5573 | 2015 | 2 | 20 | 0 | -1.46372 | -0.21795 |
| 21610 | 402101 | -1.47390 | -1.77214 | -1.15402 | -0.33214 | 0.93644 | -0.30577 | -0.62916 | -0.55883 | -0.92789 | ... | WA | King County | -0.34217 | 7895 | 2014 | 6 | 20 | 0 | -1.29349 | -0.35628 |
| 21611 | 400000 | -0.39871 | 0.50018 | -0.52249 | -0.30708 | 0.93644 | -0.30577 | -0.62916 | 0.29189 | -0.22750 | ... | WA | King County | -0.40867 | 469 | 2015 | 1 | 9 | 0 | -1.12326 | -0.31737 |
| 21612 | 325000 | -1.47390 | -1.77214 | -1.15402 | -0.33876 | 0.93644 | -0.30577 | -0.62916 | -0.55883 | -0.92789 | ... | WA | King County | -0.34217 | 7895 | 2014 | 10 | 20 | 0 | -1.25945 | -0.36287 |
21611 rows × 28 columns
this code provides a way to create a new DataFrame that is a copy of an existing DataFrame, which can be useful for making changes to the data without affecting the original DataFrame. In this case, the copy is created to prepare the data for classification tasks.
#Define the quantile boundaries
#q = [0, 0.25, 0.5, 0.75, 1]
q=[0,0.33,0.66,1]
#Define the bin labels
labels = ['SalaryA', 'SalaryB', 'SalaryC', ]
#Perform binning on the 'price' column and store the result in a new column 'price'
df_copy1['price'] = pd.qcut(df_copy1['price'], q=q, labels=labels)
#Display the updated dataframe
df_copy1
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | view | condition | grade | sqft_above | ... | state | county | population | population_density | tr_year | tr_month | city_M | state_M | age | total_sqft | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SalaryA | -0.39871 | -1.44752 | -0.97980 | -0.22833 | -0.91544 | -0.30577 | -0.62916 | -0.55883 | -0.73468 | ... | WA | King County | -0.59278 | 4966 | 2014 | 10 | 20 | 0 | 0.54501 | -0.24904 |
| 1 | SalaryB | -0.39871 | 0.17556 | 0.53370 | -0.18989 | 0.93644 | -0.30577 | -0.62916 | -0.55883 | 0.46081 | ... | WA | King County | 0.57437 | 6879 | 2014 | 12 | 20 | 0 | 0.68120 | -0.17734 |
| 2 | SalaryA | -1.47390 | -1.44752 | -1.42623 | -0.12331 | -0.91544 | -0.30577 | -0.62916 | -1.40955 | -1.22979 | ... | WA | King County | -0.92283 | 3606 | 2015 | 2 | 10 | 0 | 1.29403 | -0.15431 |
| 3 | SalaryC | 0.67648 | 1.14941 | -0.13050 | -0.24402 | -0.91544 | -0.30577 | 2.44426 | -0.55883 | -0.89167 | ... | WA | King County | -1.43043 | 6425 | 2014 | 12 | 20 | 0 | 0.20455 | -0.24591 |
| 4 | SalaryB | -0.39871 | -0.14905 | -0.43538 | -0.16966 | -0.91544 | -0.30577 | -0.62916 | 0.29189 | -0.13090 | ... | WA | King County | -0.44398 | 2411 | 2015 | 2 | 19 | 0 | -0.54447 | -0.17859 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | SalaryA | -0.39871 | 0.50018 | -0.59871 | -0.33743 | 2.78832 | -0.30577 | -0.62916 | 0.29189 | -0.31203 | ... | WA | King County | 1.36782 | 9905 | 2014 | 5 | 20 | 0 | -1.29349 | -0.34928 |
| 21609 | SalaryB | 0.67648 | 0.50018 | 0.25059 | -0.22439 | 0.93644 | -0.30577 | -0.62916 | 0.29189 | 0.62987 | ... | WA | King County | -0.42834 | 5573 | 2015 | 2 | 20 | 0 | -1.46372 | -0.21795 |
| 21610 | SalaryB | -1.47390 | -1.77214 | -1.15402 | -0.33214 | 0.93644 | -0.30577 | -0.62916 | -0.55883 | -0.92789 | ... | WA | King County | -0.34217 | 7895 | 2014 | 6 | 20 | 0 | -1.29349 | -0.35628 |
| 21611 | SalaryB | -0.39871 | 0.50018 | -0.52249 | -0.30708 | 0.93644 | -0.30577 | -0.62916 | 0.29189 | -0.22750 | ... | WA | King County | -0.40867 | 469 | 2015 | 1 | 9 | 0 | -1.12326 | -0.31737 |
| 21612 | SalaryA | -1.47390 | -1.77214 | -1.15402 | -0.33876 | 0.93644 | -0.30577 | -0.62916 | -0.55883 | -0.92789 | ... | WA | King County | -0.34217 | 7895 | 2014 | 10 | 20 | 0 | -1.25945 | -0.36287 |
21611 rows × 28 columns
This code performs quantile-based binning on the price column of a given dataframe df_copy1. The q parameter specifies the quantile boundaries used for binning, and the labels parameter specifies the labels for the resulting bins.
The pd.qcut() function is used to perform the binning operation on the price column, and the resulting output is stored in a new column called price. The qcut() function creates bins by dividing the data into intervals with an equal number of observations in each bin.
The resulting updated dataframe displays the new price column with the corresponding bin labels for each observation, allowing for further analysis and visualization of the data.
# choice columns to enter it to model
model_ft=['bedrooms', 'bathrooms', 'sqft_living','floors','view','grade', 'sqft_above', 'sqft_basement',
'lat','sqft_lot15','population']
# print colums
print('We will use these Features to build the model : '+str(model_ft))
# print count of colums
print('Number of features: '+str(len(model_ft)))
We will use these Features to build the model : ['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'view', 'grade', 'sqft_above', 'sqft_basement', 'lat', 'sqft_lot15', 'population'] Number of features: 11
This code defines a list of features to be used in building a prediction model. The list contains the names of the features that are most relevant to predicting the target variable (i.e., house price).
The print() function is used to display the list of features and the number of features in the list. This information is important for understanding the model's input data and how it is being used to make predictions.
By selecting only the most relevant features, the model can reduce the dimensionality of the input data, leading to faster training times and potentially better model performance. Additionally, using fewer features can help to avoid overfitting and reduce the risk of the model making predictions based on noise or irrelevant data.
# Identify columns with NaN values in X
cols_with_nan = df_copy1.columns[df_copy1.isna().any()].tolist()
# Drop rows with NaN values from X and y
df_copy1.dropna(subset=cols_with_nan, inplace=True)
# identify who is x
X = df_copy1[model_ft]
# identify who is y [target]
y = df_copy1['price']
This code identifies the columns in a given dataframe df_copy1 that contain NaN (Not a Number) values using the isna() and any() functions. The resulting columns are stored in a list called cols_with_nan.
The code then drops any rows with missing values from both the X and y datasets using the dropna() function, which removes any observations with missing values in the specified columns. This ensures that the model is trained on complete data.
Finally, the X and y variables are assigned to the model_ft and price columns of the updated dataframe, respectively. This allows the model to use the relevant features (model_ft) to predict the target variable (price).
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
This code splits the data into training and testing sets using the train_test_split() function from the scikit-learn library. The X and y variables are split into separate training and testing sets, with 80% of the data used for training and 20% used for testing. The random_state parameter is set to 42 to ensure reproducibility of the results. This allows the model to be trained on a subset of the data and tested on the remaining data to evaluate its performance.
# create an instance of the Logistic Regression class
lor = LogisticRegression()
# fit the linear regression model to the scaled training data
lor.fit(X_train, y_train)
# use the trained model to make predictions on the scaled testing data
y_pred_LOG= lor.predict(X_test)
# craete a accurcy variable
accuracyLOG = accuracy_score(y_test, y_pred_LOG)
# Print the name of the model and its accuracy on the test data
print('Logistic Regression Accurcy: ',accuracyLOG*100)
Logistic Regression Accurcy: 71.96391394864678
This code trains a Logistic Regressin Classifier model on the training set, X_train and y_train, using the LogisticRegression() function from the scikit-learn library. The fit() method is used to train the model on the training set, and the predict() method is used to generate predictions, y_pred_LOG, on the test set, X_test.
The accuracy_score() function is then used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyDT. Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format.
This allows for the evaluation of the Logistic Regression Classifier model's performance on the prediction task at hand.
# create confusion matrix
cm = confusion_matrix(y_test, y_pred_LOG)
# Plot a heatmap to draw confusion matrix
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create a xlabel name
plt.xlabel('Predicted label')
# create a ylable name
plt.ylabel('True label')
# create a title name
plt.title('Logistic Regression Confusion Matrix')
# display
plt.show()
# Create confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred_LOG)
#create a classification report
cr = classification_report(y_test, y_pred_LOG)
print(cm)
print(cr)
[[1139 265 14]
[ 278 885 291]
[ 16 348 1087]]
precision recall f1-score support
SalaryA 0.79 0.80 0.80 1418
SalaryB 0.59 0.61 0.60 1454
SalaryC 0.78 0.75 0.76 1451
accuracy 0.72 4323
macro avg 0.72 0.72 0.72 4323
weighted avg 0.72 0.72 0.72 4323
This code trains a Logistic Regression Classifier model on the training set, X_train and y_train, using the DecisionTreeClassifier() function from the scikit-learn library. The fit() method is used to train the model on the training set, and the predict() method is used to generate predictions, y_pred_LOG, on the test set, X_test.
The accuracy_score() function is then used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyLOG. Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format.
This allows for the evaluation of the Decision Tree Classifier model's performance on the prediction task at hand.
# create moodle
dtc = DecisionTreeClassifier()
#fitting decision tree moodle
dtc.fit(X_train, y_train)
# create varaiable and create and store the predict in it
y_pred_DT=dtc.predict(X_test)
# craete a accurcy variable
accuracyDT = accuracy_score(y_test, y_pred_DT)
# Print the name of the model and its accuracy on the test data
print('Decision Tree Accurcy: ',accuracyDT*100)
Decision Tree Accurcy: 75.15614156835531
This code trains a Decision Tree Classifier model on the training set, X_train and y_train, using the DecisionTreeClassifier() function from the scikit-learn library. The fit() method is used to train the model on the training set, and the predict() method is used to generate predictions, y_pred_DT, on the test set, X_test.
The accuracy_score() function is then used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyDT. Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format.
This allows for the evaluation of the Decision Tree Classifier model's performance on the prediction task at hand.
# create confusion matrix
cm = confusion_matrix(y_test, y_pred_DT)
# Plot a heatmap to draw confusion matrix
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create a xlabel name
plt.xlabel('Predicted label')
# create a ylable name
plt.ylabel('True label')
# create a title name
plt.title('Decision Tree Confusion Matrix')
# display
plt.show()
# Create confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred_DT)
#create a classification report
cr = classification_report(y_test, y_pred_DT)
print(cm)
print(cr)
[[1133 263 22]
[ 265 929 260]
[ 24 240 1187]]
precision recall f1-score support
SalaryA 0.80 0.80 0.80 1418
SalaryB 0.65 0.64 0.64 1454
SalaryC 0.81 0.82 0.81 1451
accuracy 0.75 4323
macro avg 0.75 0.75 0.75 4323
weighted avg 0.75 0.75 0.75 4323
This code creates a confusion matrix to evaluate the performance of a Decision Tree Classifier model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.
The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.
The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.
# create random forest model
rfc = RandomForestClassifier()
# fitting model
rfc.fit(X_train, y_train)
# create a y_pred_RF to store in predict
y_pred_RF=rfc.predict(X_test)
# create a ccuracy variable to store accuracy score
accuracyRF = accuracy_score(y_test, y_pred_RF)
# Print the name of the model and its accuracy on the test data
print('Random Forest Accurcy: ',accuracyRF*100)
Random Forest Accurcy: 81.72565348137867
# create a confusion matrix
cm = confusion_matrix(y_test, y_pred_RF)
# Plot confusion matrix
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create xlabel name
plt.xlabel('Predicted label')
# create a ylabel name
plt.ylabel('True label')
# create title name
plt.title('Random Forest Confusion Matrix')
# display
plt.show()
# Create confusion matrix classification report
cm = confusion_matrix(y_test, y_pred_RF)
# create a classification report
cr = classification_report(y_test, y_pred_RF)
print(cm)
print(cr)
[[1192 218 8]
[ 175 1087 192]
[ 5 192 1254]]
precision recall f1-score support
SalaryA 0.87 0.84 0.85 1418
SalaryB 0.73 0.75 0.74 1454
SalaryC 0.86 0.86 0.86 1451
accuracy 0.82 4323
macro avg 0.82 0.82 0.82 4323
weighted avg 0.82 0.82 0.82 4323
This code creates a confusion matrix to evaluate the performance of a Random Forest Classifier model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.
The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.
The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.
# create a Gradient Boosting Classifier
gbc = GradientBoostingClassifier()
# fitting model
gbc.fit(X_train, y_train)
#stor predict variable in Y_pred
y_pred_GB=gbc.predict(X_test)
# store accurcy score in variable
accuracyGB = accuracy_score(y_test, y_pred_GB)
# Print the name of the model and its accuracy on the test data
print('Random Forest Accurcy: ',accuracyGB*100)
Random Forest Accurcy: 81.74878556557947
This code creates a Gradient Boosting Classifier model using the GradientBoostingClassifier() function from the scikit-learn library. The fit() method is used to train the model on the training set, X_train and y_train.
The predict() method is then used to generate predictions, y_pred_GB, on the test set, X_test. The accuracy_score() function is used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyGB.
Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format. This allows for the evaluation of the Gradient Boosting Classifier model's performance on the prediction task at hand.
# create confusion matrix
cm = confusion_matrix(y_test, y_pred_GB)
# Plot confusion matrix
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create xlabel Name Predicted label
plt.xlabel('Predicted label')
# create ylabel True label
plt.ylabel('True label')
# create title Gradient Boosting Confusion Matrix
plt.title('Gradient Boosting Confusion Matrix')
# display confusion matrix
plt.show()
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_GB)
# Create classification report
cr = classification_report(y_test, y_pred_GB)
# print confusion_matrix
print(cm)
# print classification_report
print(cr)
[[1208 203 7]
[ 172 1096 186]
[ 4 217 1230]]
precision recall f1-score support
SalaryA 0.87 0.85 0.86 1418
SalaryB 0.72 0.75 0.74 1454
SalaryC 0.86 0.85 0.86 1451
accuracy 0.82 4323
macro avg 0.82 0.82 0.82 4323
weighted avg 0.82 0.82 0.82 4323
This code creates a confusion matrix to evaluate the performance of a Gradient Boosting Classifier model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.
The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.
The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.
The resulting confusion matrix and classification report allow for the evaluation of the Gradient Boosting Classifier model's performance on the prediction task at hand, and can be used to compare its performance to other models in the analysis.
# create AdaBoostClassifier
abc = AdaBoostClassifier()
# fitting model
abc.fit(X_train, y_train)
# store predict value into y_pred_AB variable
y_pred_AB=abc.predict(X_test)
# store accuracy_score value into accuracyAB variable
accuracyAB = accuracy_score(y_test, y_pred_AB)
# Print the name of the model and its accuracy on the test data
print('AdaBoost Accurcy: ',accuracyAB*100)
AdaBoost Accurcy: 78.0707841776544
This code creates an AdaBoost Classifier model using the AdaBoostClassifier() function from the scikit-learn library. The fit() method is used to train the model on the training set, X_train and y_train.
The predict() method is then used to generate predictions, y_pred_AB, on the test set, X_test. The accuracy_score() function is used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracyAB.
Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format. This allows for the evaluation of the AdaBoost Classifier model's performance on the prediction task at hand.
By creating and training an AdaBoost Classifier, the code is exploring the use of an ensemble learning technique that combines multiple "weak" models to create a stronger predictor. AdaBoost works by iteratively adjusting the weights of misclassified observations to focus on those that are most difficult to predict. This can lead to improved accuracy and performance compared to using a single model.
# create confusion_matrix
cm = confusion_matrix(y_test, y_pred_AB)
# Plot confusion matrix using heatmap
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create xlabel Name Predicted label
plt.xlabel('Predicted label')
# create ylabel True label
plt.ylabel('True label')
# create title AdaBoost Confusion Matrix
plt.title('AdaBoost Confusion Matrix')
# display confusion matrix
plt.show()
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_AB)
# Create classification report
cr = classification_report(y_test, y_pred_AB)
# print confusion_matrix
print(cm)
# print classification_report
print(cr)
[[1128 279 11]
[ 164 1087 203]
[ 8 283 1160]]
precision recall f1-score support
SalaryA 0.87 0.80 0.83 1418
SalaryB 0.66 0.75 0.70 1454
SalaryC 0.84 0.80 0.82 1451
accuracy 0.78 4323
macro avg 0.79 0.78 0.78 4323
weighted avg 0.79 0.78 0.78 4323
This code creates a confusion matrix to evaluate the performance of an AdaBoost Classifier model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.
The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.
The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.
The resulting confusion matrix and classification report allow for the evaluation of the AdaBoost Classifier model's performance on the prediction task at hand, and can be used to compare its performance to other models in the analysis.
By creating and evaluating the performance of an AdaBoost Classifier, the code is exploring the use of an ensemble learning technique that combines multiple weak models to create a stronger predictor. AdaBoost works by iteratively adjusting the weights of misclassified observations to focus on those that are most difficult to predict. This can lead to improved accuracy and performance compared tousing a single model.
# create SupportVectorClassifier
svc = SVC()
# fitting model
svc.fit(X_train, y_train)
# store predict value into y_pred_SV variable
y_pred_SV=svc.predict(X_test)
# store accuracy_score value into accuracySVC variable
accuracySVC = accuracy_score(y_test, y_pred_SV)
# Print the name of the model and its accuracy on the test data
print('Support Vector Accurcy: ',accuracySVC*100)
Support Vector Accurcy: 79.15799213509138
This code creates a Support Vector Machine (SVM) model using the SVC() function from the scikit-learn library. The fit() method is used to train the model on the training set, X_train and y_train.
The predict() method is then used to generate predictions, y_pred_SV, on the test set, X_test. The accuracy_score() function is used to compute the accuracy of the model's predictions on the test set, and the resulting value is stored in the variable accuracySVC.
Finally, the print() function is used to display the name of the model and its accuracy on the test data in percentage format. This allows for the evaluation of the SVM model's performance on the prediction task at hand.
# create confusion_matrix
cm = confusion_matrix(y_test, y_pred_SV)
# Plot confusion matrix using heatmap
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
# create xlabel Name Predicted label
plt.xlabel('Predicted label')
# create ylabel True label
plt.ylabel('True label')
# create title Support Vector Confusion Matrix
plt.title('Support Vector Confusion Matrix')
# display confusion matrix
plt.show()
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred_SV)
# Create classification report
cr = classification_report(y_test, y_pred_SV)
# print confusion_matrix
print(cm)
# print classification_report
print(cr)
[[1152 256 10]
[ 170 1087 197]
[ 3 265 1183]]
precision recall f1-score support
SalaryA 0.87 0.81 0.84 1418
SalaryB 0.68 0.75 0.71 1454
SalaryC 0.85 0.82 0.83 1451
accuracy 0.79 4323
macro avg 0.80 0.79 0.79 4323
weighted avg 0.80 0.79 0.79 4323
This code creates a confusion matrix to evaluate the performance of a Support Vector Machine (SVM) model on the test set. The confusion_matrix() function from the scikit-learn library is used to compute the confusion matrix, and the resulting matrix is stored in the variable cm.
The heatmap() function from the seaborn library is then used to create a heatmap visualization of the confusion matrix, with the annot=True parameter used to display the values of the matrix in each cell. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.
The code also creates a classification report using the classification_report() function from the scikit-learn library, which provides a detailed summary of the model's performance on each class in the target variable. The resulting confusion matrix and classification report provide insights into the model's performance and can be used to identify areas for improvement in the model.
The resulting confusion matrix and classification report allow for the evaluation of the SVM model's performance on the prediction task at hand, and can be used to compare its performance to other models in the analysis.
By creating and evaluating the performance of an SVM model, the code is exploring the use of a powerful and versatile algorithm that can be used for both classification and regression tasks. SVM works by finding the optimal hyperplane that separates the data points into their respective classes, with the goal of maximizing the margin between the hyperplane and the closestpoints. This can lead to improved accuracy and performance compared to other linear classification models.
model_names = ['Log','DT', 'RF', 'GB', 'AB', 'SVC']
# Train and evaluate each model
accuracies = []
accuracies.append(accuracyLOG)
accuracies.append(accuracyDT)
accuracies.append(accuracyRF)
accuracies.append(accuracyGB)
accuracies.append(accuracyAB)
accuracies.append(accuracySVC)
# Create a dataframe to store the evaluation metrics
evaluation_df = pd.DataFrame({'Model': model_names,
'Accuracy': accuracies
})
# Print the evaluation metrics for each model
print(evaluation_df)
# Create a bar plot to compare the accuracy of the models
plt.bar(model_names, accuracies)
plt.title('Model Accuracy')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.show()
# Print the best model based on accuracy
best_model = evaluation_df.loc[evaluation_df['Accuracy'].idxmax(), 'Model']
print(f'Best model: {best_model}')
Model Accuracy 0 Log 0.71964 1 DT 0.75156 2 RF 0.81726 3 GB 0.81749 4 AB 0.78071 5 SVC 0.79158
Best model: GB
This code trains and evaluates multiple classification models using decision trees, random forests, gradient boosting, AdaBoost, and support vector machines (SVM).
The model_names list is created to store the names of each model for later use in the analysis.
The accuracy of each model is computed and stored in the accuracies list using the accuracy_score() function from the scikit-learn library.
A dataframe called evaluation_df is created to store the evaluation metrics for each model, including the model name and accuracy. This dataframe is printed to the console using the print() function.
A bar plot is created using the bar() function from the matplotlib library to compare the accuracy of the models. Appropriate axis labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions.
The idxmax() function is used to find the index of the highest accuracy value in the evaluation_df dataframe, and the corresponding model name is printed to the console using the print() function. This allows us to identify the best-performing model based on accuracy.
filename='Random_Forest_Model_Regression.joblib'
joblib.dump(rf,filename)
['Random_Forest_Model_Regression.joblib']
loaded_model=joblib.load(filename)
Y_Pred=loaded_model.predict([[3,1,1180,1,0,7,1180,0,47.5,5650,24092]])
Y_Pred
array([4334920.])
import pickle
with open('Random_Forest_Model_Regression.pkl', 'wb') as f:
pickle.dump(rf, f)
df_copy[['bedrooms', 'bathrooms', 'sqft_living','floors','view','grade', 'sqft_above', 'sqft_basement',
'lat','sqft_lot15','population','price']]
| bedrooms | bathrooms | sqft_living | floors | view | grade | sqft_above | sqft_basement | lat | sqft_lot15 | population | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.39871 | -1.44752 | -0.97980 | -0.91544 | -0.30577 | -0.55883 | -0.73468 | -0.65869 | -0.35251 | -0.26072 | -0.59278 | 221900 |
| 1 | -0.39871 | 0.17556 | 0.53370 | 0.93644 | -0.30577 | -0.55883 | 0.46081 | 0.24531 | 1.16158 | -0.18788 | 0.57437 | 538000 |
| 2 | -1.47390 | -1.44752 | -1.42623 | -0.91544 | -0.30577 | -1.40955 | -1.22979 | -0.65869 | 1.28355 | -0.17239 | -0.92283 | 180000 |
| 3 | 0.67648 | 1.14941 | -0.13050 | -0.91544 | -0.30577 | -0.55883 | -0.89167 | 1.39791 | -0.28323 | -0.28453 | -1.43043 | 604000 |
| 4 | -0.39871 | -0.14905 | -0.43538 | -0.91544 | -0.30577 | 0.29189 | -0.13090 | -0.65869 | 0.40959 | -0.19286 | -0.44398 | 510000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | -0.39871 | 0.50018 | -0.59871 | 2.78832 | -0.30577 | 0.29189 | -0.31203 | -0.65869 | 1.00498 | -0.41238 | 1.36782 | 360000 |
| 21609 | 0.67648 | 0.50018 | 0.25059 | 0.93644 | -0.30577 | 0.29189 | 0.62987 | -0.65869 | -0.35612 | -0.20396 | -0.42834 | 400000 |
| 21610 | -1.47390 | -1.77214 | -1.15402 | 0.93644 | -0.30577 | -0.55883 | -0.92789 | -0.65869 | 0.24793 | -0.39414 | -0.34217 | 402101 |
| 21611 | -0.39871 | 0.50018 | -0.52249 | 0.93644 | -0.30577 | 0.29189 | -0.22750 | -0.65869 | -0.18436 | -0.42051 | -0.40867 | 400000 |
| 21612 | -1.47390 | -1.77214 | -1.15402 | 0.93644 | -0.30577 | -0.55883 | -0.92789 | -0.65869 | 0.24576 | -0.41795 | -0.34217 | 325000 |
21611 rows × 12 columns
filenameC='Random_Forset_Model_Classification.joblib'
joblib.dump(gbc,filenameC)
['Random_Forset_Model_Classification.joblib']
loaded_model=joblib.load(filenameC)
Y_Pred=loaded_model.predict([[3,1,1180,1,0,7,1180,0,47.5,5650,24092]])
Y_Pred
array(['SalaryC'], dtype=object)
import pickle
with open('Random_Forset_Model_Classification.pkl', 'wb') as f:
pickle.dump(rfc, f)
df_copy1[['bedrooms', 'bathrooms', 'sqft_living','floors','view','grade', 'sqft_above', 'sqft_basement',
'lat','sqft_lot15','population','price']]
| bedrooms | bathrooms | sqft_living | floors | view | grade | sqft_above | sqft_basement | lat | sqft_lot15 | population | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.39871 | -1.44752 | -0.97980 | -0.91544 | -0.30577 | -0.55883 | -0.73468 | -0.65869 | -0.35251 | -0.26072 | -0.59278 | SalaryA |
| 1 | -0.39871 | 0.17556 | 0.53370 | 0.93644 | -0.30577 | -0.55883 | 0.46081 | 0.24531 | 1.16158 | -0.18788 | 0.57437 | SalaryB |
| 2 | -1.47390 | -1.44752 | -1.42623 | -0.91544 | -0.30577 | -1.40955 | -1.22979 | -0.65869 | 1.28355 | -0.17239 | -0.92283 | SalaryA |
| 3 | 0.67648 | 1.14941 | -0.13050 | -0.91544 | -0.30577 | -0.55883 | -0.89167 | 1.39791 | -0.28323 | -0.28453 | -1.43043 | SalaryC |
| 4 | -0.39871 | -0.14905 | -0.43538 | -0.91544 | -0.30577 | 0.29189 | -0.13090 | -0.65869 | 0.40959 | -0.19286 | -0.44398 | SalaryB |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 21608 | -0.39871 | 0.50018 | -0.59871 | 2.78832 | -0.30577 | 0.29189 | -0.31203 | -0.65869 | 1.00498 | -0.41238 | 1.36782 | SalaryA |
| 21609 | 0.67648 | 0.50018 | 0.25059 | 0.93644 | -0.30577 | 0.29189 | 0.62987 | -0.65869 | -0.35612 | -0.20396 | -0.42834 | SalaryB |
| 21610 | -1.47390 | -1.77214 | -1.15402 | 0.93644 | -0.30577 | -0.55883 | -0.92789 | -0.65869 | 0.24793 | -0.39414 | -0.34217 | SalaryB |
| 21611 | -0.39871 | 0.50018 | -0.52249 | 0.93644 | -0.30577 | 0.29189 | -0.22750 | -0.65869 | -0.18436 | -0.42051 | -0.40867 | SalaryB |
| 21612 | -1.47390 | -1.77214 | -1.15402 | 0.93644 | -0.30577 | -0.55883 | -0.92789 | -0.65869 | 0.24576 | -0.41795 | -0.34217 | SalaryA |
21611 rows × 12 columns
The provided code shows a sample of a housing dataset containing various features such as price, number of bedrooms and bathrooms, living area, lot size, floors, waterfront, view, grade, year built, and others. The analysis of this dataset includes data exploration, manipulation, visualization, and machine learning techniques to predict the housing prices and classify the properties.
The machine learning section includes regression and classification models, which aims to predict the housing prices or classify the properties based on their features. The exploratory data analysis includes analyzing and visualizing the relationships between the features and the target variable, identifying outliers, and understanding the distribution of the data.
The insights gained from this analysis could be useful for various stakeholders, such as home buyers, real estate agents, and property developers. The developed models provide a way to accurately predict the housing prices or classify the properties based on certain features, which could assist in making informed decisions about buying or selling properties. Overall, this analysis provides a valuable contribution to the field of real estate by identifying the factors that influence housing prices and developing accurate prediction models.
Although the analysis of the housing dataset has provided valuable insights and developed accurate prediction models, there are still areas for further research and improvement. Some potential future work includes:
Incorporating additional features: The dataset used in this analysis includes various features, but there may be other features that could influence housing prices, such as crime rates, proximity to schools or public transportation, and nearby amenities. Incorporating these features could improve the accuracy of the prediction models.
Improving the models' performance: Although the developed models have high accuracy, there is still room for improvement. Techniques such as ensemble learning, feature selection, and hyperparameter tuning could be used to further enhance the models' performance.
Evaluating the models' generalizability: The developed models were trained and tested on a specific dataset, so it is important to evaluate their generalizability to other datasets or real-world scenarios. Cross-validation techniques and testing on new datasets could be used to assess the models' generalizability.
Exploring interpretability: While the developed models have high accuracy, they may lack interpretability, meaning it may not be clear which features are driving the predictions. Exploring interpretability techniques such as feature importance and partial dependence plots could provide insights into the models' decision-making processes.
Overall, these potential future work areas could further improve the accuracy and applicability of the developed models, providing valuable insights for real estate stakeholders and contributing to the field of real estate research.